replikation / What_the_Phage

WtP: Phage identification via nextflow and docker or singularity
https://mult1fractal.github.io/wtp-documentation/
GNU General Public License v3.0
102 stars 15 forks source link

sourmash with kmer based json DB #40

Closed replikation closed 4 years ago

replikation commented 4 years ago

sourmash / kmer approach implementation

database

implementation

replikation commented 4 years ago

see also #6

phiweger commented 4 years ago

I can take care of this. Question is: Would you like to (1) just "sourmash search" through the database (I'll use the sequence Bloom tree for faster search) OR (2) use "sourmash gather" @replikation @Stormrider935 -- the difference is that (1) returns independent pairwise similarities for all genomes in the database, while the latter will "remove" those kmers from the query contigs, that have been confidently assigned to genomes. call it the "winner takes it all" strategy.

phiweger commented 4 years ago

Also: is there already a list of accessions from #6 ?

replikation commented 4 years ago

@Stormrider935 will link today all the data he has in here.

mult1fractal commented 4 years ago

Here the downloadlink for the phage-genomes they used: Phage_genomes

Format of the Pro-Phage-genome-multi fasta phage_genome_fromACLAME.fasta

>mge:2008.1 Rhodobacter sphaeroides ATCC 17025, complete genome.
TTACGTCACCTCGTAGAGAGACCAGCGGACGACGCCCTGGATCCGTCTCAGGTATTGGTC
ACCTCTGAATGCCCACTTGAACCGGATCGTGGCACCGCCGAACCAGGTCGTGCGGTTGAT

Format of the Phage-genome-multi fasta phage_genome_fromNCBI.fasta

>gi|1001940386|gb|KU522583.1|__Enterobacteria__phage__ECGD1,__complete__genome
GCTACCTGGTGCACCGTGCCGCCTTGCCAGCCCAATGAGTGCATGAGATGTAAAGGAGTATTTGTCATGG
TATCGCCCTTTCTTTGATGTGCGCCCTGGTTGGCGCGGTAGAAAGATAATAGCCGTTTCCTTTATAGATA

Update: Here the accession number list for the phage_genome_fromNCBI.fasta accession_number_phage_genome_fromNCBI.txt

hoelzer commented 4 years ago

@Stormrider935 if there are any issues or we have questions about how they generated the 'databases', I have some connection to the last author (Gianni, HKI Jena) so could also ask him

mult1fractal commented 4 years ago

i backtracked the paper where they got the Idea from. Thank you @hoelzer for the offer ;)

phiweger commented 4 years ago

so here is my proposed code for searching phages w/ sourmash -- somebody please put this into some nextflow process :)

k=21 is more sensitive than the default 31, but we expect more distance btw/ phage genomes, so less is better.

scaled=100 means one out of 100 kmers is sampled; for the average 100k nt genome size for phages this makes for a signature size of (reasonable) 1k kmers per genome.

the sbt .. sequence bloom tree is just an index for fast search.

for assembled metagenomes I'd use sourmash searchand for unassembled ones sourmash gather (see docs for what's the delta).

conda create -y -n phages python=3.7 && conda activate phages
conda install -y sourmash=2.3.0

wget http://147.8.185.62/VirMiner/downloads/phage_genome/public_phage_genomes.fasta
DB=public_phage_genomes.fasta
SCALE=100
KSIZE=21

sourmash compute --scaled $SCALE -k $KSIZE --singleton --seed 42 -p 8 -o phages.sig $DB
# calculated 4078 signatures for 4078 sequences in public_phage_genomes.fasta
sourmash index phages phages.sig

# create test seq, but first:
# turn multiline fasta into fmt where each sequence is on a single line
# https://www.biostars.org/p/9262/
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < $DB | head -n 3 | tail -n 2 > test.fasta

sourmash compute --scaled $SCALE -k $KSIZE -o test.sig test.fasta
sourmash search test.sig phages.sbt.json
# 2 matches:
# similarity   match
# ----------   -----
# 100.0%       gi|1001940386|gb|KU522583.1|__Enterobacteria__phage__ECGD...
#  45.0%       gi|396576808|emb|FR775895.2|__Enterobacteria__phage__phi9...
sourmash gather test.sig phages.sbt.json
# overlap     p_query p_match
# ---------   ------- -------
# 147.2 kbp    100.0%  100.0%    gi|1001940386|gb|KU522583.1|__Enterob...
# 
# found 1 matches total;
# the recovered matches hit 100.0% of the query
replikation commented 4 years ago
replikation commented 4 years ago
phiweger commented 4 years ago

@replikation the multiline2singleline awk magic is just so I can create a quick test file. it is not required for sourmash, but I just used it to generate a full example.

replikation commented 4 years ago