replikation commented 4 years ago

sourmash / kmer approach implementation

not sure about the quality but I want to test this

database

we need to pre-build a phage DB @stormrider can you link collect DBs for this and report it here?

implementation

depending on the size we could do either pre-prepare a DB our compute it live while WtP is running to be always most recent (via efetch)

replikation commented 4 years ago

see also #6

phiweger commented 4 years ago

I can take care of this. Question is: Would you like to (1) just "sourmash search" through the database (I'll use the sequence Bloom tree for faster search) OR (2) use "sourmash gather" @replikation @Stormrider935 -- the difference is that (1) returns independent pairwise similarities for all genomes in the database, while the latter will "remove" those kmers from the query contigs, that have been confidently assigned to genomes. call it the "winner takes it all" strategy.

phiweger commented 4 years ago

Also: is there already a list of accessions from #6 ?

replikation commented 4 years ago

@phiweger thank you very much
@phiweger good question, I don't have experience on kmer results for "just" phages. So I am unsure if a "winner takes all" approach fits phages, at least for the prediction part (if a contig contains a phage)
in any case, the goal is to add this approach as "another" tool to the existing 6 phage prediction tools, maybe via search - as we can control and build in this case easily the phage database
it might also be useful after the "phage predictions" to analyse the positive hits if they match against known phages (via gather)

@Stormrider935 will link today all the data he has in here.

mult1fractal commented 4 years ago

I found the "database" (Its a multi-fasta-file) in this paper

Here the downloadlink for the phage-genomes they used: Phage_genomes

they have a Pro-Phage-genome-Database (ALCAME-database) 759 complete genomes and a phage-genome-Database (acquired from NCBI nucleotide) 3319 complete genomes

Format of the Pro-Phage-genome-multi fasta phage_genome_fromACLAME.fasta

>mge:2008.1 Rhodobacter sphaeroides ATCC 17025, complete genome.
TTACGTCACCTCGTAGAGAGACCAGCGGACGACGCCCTGGATCCGTCTCAGGTATTGGTC
ACCTCTGAATGCCCACTTGAACCGGATCGTGGCACCGCCGAACCAGGTCGTGCGGTTGAT

Format of the Phage-genome-multi fasta phage_genome_fromNCBI.fasta

>gi|1001940386|gb|KU522583.1|__Enterobacteria__phage__ECGD1,__complete__genome
GCTACCTGGTGCACCGTGCCGCCTTGCCAGCCCAATGAGTGCATGAGATGTAAAGGAGTATTTGTCATGG
TATCGCCCTTTCTTTGATGTGCGCCCTGGTTGGCGCGGTAGAAAGATAATAGCCGTTTCCTTTATAGATA

I can extract the accession numbers from the Phage-genome-multi fasta and create an accession number list like suggested in #6

Update: Here the accession number list for the phage_genome_fromNCBI.fasta accession_number_phage_genome_fromNCBI.txt

the database is from October 2016... I will try to prepare a more up-to-date database with the method they used for this database.

hoelzer commented 4 years ago

@Stormrider935 if there are any issues or we have questions about how they generated the 'databases', I have some connection to the last author (Gianni, HKI Jena) so could also ask him

mult1fractal commented 4 years ago

i backtracked the paper where they got the Idea from. Thank you @hoelzer for the offer ;)

phiweger commented 4 years ago

so here is my proposed code for searching phages w/ sourmash -- somebody please put this into some nextflow process :)

k=21 is more sensitive than the default 31, but we expect more distance btw/ phage genomes, so less is better.

scaled=100 means one out of 100 kmers is sampled; for the average 100k nt genome size for phages this makes for a signature size of (reasonable) 1k kmers per genome.

the sbt .. sequence bloom tree is just an index for fast search.

for assembled metagenomes I'd use sourmash searchand for unassembled ones sourmash gather (see docs for what's the delta).

conda create -y -n phages python=3.7 && conda activate phages
conda install -y sourmash=2.3.0

wget http://147.8.185.62/VirMiner/downloads/phage_genome/public_phage_genomes.fasta
DB=public_phage_genomes.fasta
SCALE=100
KSIZE=21

sourmash compute --scaled $SCALE -k $KSIZE --singleton --seed 42 -p 8 -o phages.sig $DB
# calculated 4078 signatures for 4078 sequences in public_phage_genomes.fasta
sourmash index phages phages.sig

# create test seq, but first:
# turn multiline fasta into fmt where each sequence is on a single line
# https://www.biostars.org/p/9262/
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < $DB | head -n 3 | tail -n 2 > test.fasta

sourmash compute --scaled $SCALE -k $KSIZE -o test.sig test.fasta
sourmash search test.sig phages.sbt.json
# 2 matches:
# similarity   match
# ----------   -----
# 100.0%       gi|1001940386|gb|KU522583.1|__Enterobacteria__phage__ECGD...
#  45.0%       gi|396576808|emb|FR775895.2|__Enterobacteria__phage__phi9...
sourmash gather test.sig phages.sbt.json
# overlap     p_query p_match
# ---------   ------- -------
# 147.2 kbp    100.0%  100.0%    gi|1001940386|gb|KU522583.1|__Enterob...
# 
# found 1 matches total;
# the recovered matches hit 100.0% of the query

replikation commented 4 years ago

cool thanks for the work, but I don't get the linearisation part of the fasta to one-liner
is sourmash not able to cope with multiline fastas?

replikation commented 4 years ago

my proposal would be now to add a dedicated database module which creates the DB via sourmash
in this way we could also add more data to the module/DB, e.g. via efetch from NCBI
also makes sense due to the way how nextflow handles things for parallelisation (e.g. use DB once and give it each sourmash process)
we add two sourmash modules into WtP, one for identification (like the other current six tools) and one to trying to type positive hits after samtools modules extract positive hits
in this way we could later separate predicted phages contigs via sourmash into two groups "highly similar to a known phage", "phages currently not available via DB".

phiweger commented 4 years ago

@replikation the multiline2singleline awk magic is just so I can create a quick test file. it is not required for sourmash, but I just used it to generate a full example.

replikation commented 4 years ago

included in 6f8266a

replikation / What_the_Phage

sourmash with kmer based json DB #40

sourmash / kmer approach implementation

database

implementation