Closed replikation closed 4 years ago
see also #6
I can take care of this. Question is: Would you like to (1) just "sourmash search" through the database (I'll use the sequence Bloom tree for faster search) OR (2) use "sourmash gather" @replikation @Stormrider935 -- the difference is that (1) returns independent pairwise similarities for all genomes in the database, while the latter will "remove" those kmers from the query contigs, that have been confidently assigned to genomes. call it the "winner takes it all" strategy.
Also: is there already a list of accessions from #6 ?
@Stormrider935 will link today all the data he has in here.
Here the downloadlink for the phage-genomes they used: Phage_genomes
Format of the Pro-Phage-genome-multi fasta
phage_genome_fromACLAME.fasta
>mge:2008.1 Rhodobacter sphaeroides ATCC 17025, complete genome.
TTACGTCACCTCGTAGAGAGACCAGCGGACGACGCCCTGGATCCGTCTCAGGTATTGGTC
ACCTCTGAATGCCCACTTGAACCGGATCGTGGCACCGCCGAACCAGGTCGTGCGGTTGAT
Format of the Phage-genome-multi fasta
phage_genome_fromNCBI.fasta
>gi|1001940386|gb|KU522583.1|__Enterobacteria__phage__ECGD1,__complete__genome
GCTACCTGGTGCACCGTGCCGCCTTGCCAGCCCAATGAGTGCATGAGATGTAAAGGAGTATTTGTCATGG
TATCGCCCTTTCTTTGATGTGCGCCCTGGTTGGCGCGGTAGAAAGATAATAGCCGTTTCCTTTATAGATA
Update:
Here the accession number list for the phage_genome_fromNCBI.fasta
accession_number_phage_genome_fromNCBI.txt
@Stormrider935 if there are any issues or we have questions about how they generated the 'databases', I have some connection to the last author (Gianni, HKI Jena) so could also ask him
i backtracked the paper where they got the Idea from. Thank you @hoelzer for the offer ;)
so here is my proposed code for searching phages w/ sourmash -- somebody please put this into some nextflow process :)
k=21 is more sensitive than the default 31, but we expect more distance btw/ phage genomes, so less is better.
scaled=100 means one out of 100 kmers is sampled; for the average 100k nt genome size for phages this makes for a signature size of (reasonable) 1k kmers per genome.
the sbt .. sequence bloom tree is just an index for fast search.
for assembled metagenomes I'd use sourmash search
and for unassembled ones sourmash gather
(see docs for what's the delta).
conda create -y -n phages python=3.7 && conda activate phages
conda install -y sourmash=2.3.0
wget http://147.8.185.62/VirMiner/downloads/phage_genome/public_phage_genomes.fasta
DB=public_phage_genomes.fasta
SCALE=100
KSIZE=21
sourmash compute --scaled $SCALE -k $KSIZE --singleton --seed 42 -p 8 -o phages.sig $DB
# calculated 4078 signatures for 4078 sequences in public_phage_genomes.fasta
sourmash index phages phages.sig
# create test seq, but first:
# turn multiline fasta into fmt where each sequence is on a single line
# https://www.biostars.org/p/9262/
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < $DB | head -n 3 | tail -n 2 > test.fasta
sourmash compute --scaled $SCALE -k $KSIZE -o test.sig test.fasta
sourmash search test.sig phages.sbt.json
# 2 matches:
# similarity match
# ---------- -----
# 100.0% gi|1001940386|gb|KU522583.1|__Enterobacteria__phage__ECGD...
# 45.0% gi|396576808|emb|FR775895.2|__Enterobacteria__phage__phi9...
sourmash gather test.sig phages.sbt.json
# overlap p_query p_match
# --------- ------- -------
# 147.2 kbp 100.0% 100.0% gi|1001940386|gb|KU522583.1|__Enterob...
#
# found 1 matches total;
# the recovered matches hit 100.0% of the query
my proposal would be now to add a dedicated database module which creates the DB via sourmash
in this way we could also add more data to the module/DB, e.g. via efetch from NCBI
also makes sense due to the way how nextflow handles things for parallelisation (e.g. use DB once and give it each sourmash process)
we add two sourmash modules into WtP, one for identification (like the other current six tools) and one to trying to type positive hits after samtools modules extract positive hits
in this way we could later separate predicted phages contigs via sourmash into two groups "highly similar to a known phage", "phages currently not available via DB".
@replikation the multiline2singleline awk magic is just so I can create a quick test file. it is not required for sourmash, but I just used it to generate a full example.
sourmash / kmer approach implementation
database
implementation