sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
467 stars 79 forks source link

query was calculated with scaled, tree was not. #1273

Closed J-I-P closed 3 years ago

J-I-P commented 3 years ago

Hi, I try to build my SBT database, but I get some problems. There is my command below:

  1. I use computes to saves signatures for NCBI sequence. find . -name '*' | xargs sourmash compute --output NCBI_genome_ALL image The picture is my output describe
  2. Index to build SBT database sourmash index -k 31 NCBI.sbt.zip NCBI_genome_ALL
  3. And then, gather the sequence sourmash gather zymoLOG_after_medaka.fasta.sig ../NCBI.sbt.json When I create zymoLOG_after_medaka.fasta.sig I used --scaled=1000.

But I get the result like below: image I get this response because when I compute the NCBI sequence didn't use --scaled? I should try to add --scaled=10000?

Thanks for any advice on this!

ctb commented 3 years ago

yes, I think so!

image I get this response because when I compute the NCBI sequence didn't use --scaled? I should try to add --scaled=10000?

yes, that should fix it!

best, --titus

J-I-P commented 3 years ago

Hi, @ctb, I have a doubt about my step 1. image I use compute to saves signatures for NCBI sequence. There are 182666 .gz files in this directory, but this picture shows computing signatures for files only 2921 .gz files. Is this correct? or something I get wrong? Thank you for your reply.

ctb commented 3 years ago

hmm, you might want to use --outdir instead of --output? if I'm reading the command correctly, and if I remember how xargs works, you are running sourmash compute on batches of files, and --output will combine each batch into a single signature.

I tried this command:

find . -name '*.fa' | xargs sourmash compute --scaled=1000 --output xyz

on a smaller collection of sequences with xyz/ being a directory and it failed, because --output needs to be a filename. So I'm not sure what's happening with your command - is NCBI.genomes.k31.s1000 a directory, or a file, or what?

This command worked fine:

find . -name '*.fa' | xargs sourmash compute --scaled=1000 --outdir xyz
J-I-P commented 3 years ago

hmm, you might want to use --outdir instead of --output? if I'm reading the command correctly, and if I remember how xargs works, you are running sourmash compute on batches of files, and --output will combine each batch into a single signature.

I tried this command:

find . -name '*.fa' | xargs sourmash compute --scaled=1000 --output xyz

on a smaller collection of sequences with xyz/ being a directory and it failed, because --output needs to be a filename. So I'm not sure what's happening with your command - is NCBI.genomes.k31.s1000 a directory, or a file, or what?

This command worked fine:

find . -name '*.fa' | xargs sourmash compute --scaled=1000 --outdir xyz

My NCBI.genomes.k31.s1000 is a file. If after I use --outdir, I also can use sourmash index to build a SBT database?

ctb commented 3 years ago

My NCBI.genomes.k31.s1000 is a file. If after I use --outdir, I also can use sourmash index to build SBT database?

yes!

J-I-P commented 3 years ago

Hi, @ctb
How you bulid refseq-k31.sbt.json database? I have a wear problem with my database (SBT). image The blue word is my diectory, and I index it to .sbt.zip. I try to build all .gz of genomes in NCBI (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/), how about your refseq-k31 database?

image This picture is using your database, and the result is correct and perfect!!

But when I use my database is like below: image The overlap is less than your result, so I think there is some trouble in my database. But I have no idea how to check about my .sbt.zip or .sig file?

Thanks for any advice on this!

ctb commented 3 years ago

hi @J-I-P,

hmm, interesting. To me it looks like everything is working fine technically, which makes me wonder if the database contents are different. Can you try doing a search for that first pseudomonas match from refseq in your database, and see what it reports?

I am not 100% clear on what is in RefSeq, vs what is in GenBank.

J-I-P commented 3 years ago

hi @J-I-P,

hmm, interesting. To me it looks like everything is working fine technically, which makes me wonder if the database contents are different. Can you try doing a search for that first pseudomonas match from refseq in your database, and see what it reports?

I am not 100% clear on what is in RefSeq, vs what is in GenBank.

image This is the first pseudomonas in your database, and its assembly is GCF_000504485.1 image So I try to search this assembly in my database directory before index, and it can find the .sig file.

ctb commented 3 years ago

ok - and when you do a sourmash gather on that specific genome signature in the new database you constructed, you don’t find that match right?

could you post the text versions (not screenshots, please :) of the commands you used to index the database?

thanks, —titus

On Jan 17, 2021, at 11:52 AM, Nicole notifications@github.com wrote:

hi @J-I-P,

hmm, interesting. To me it looks like everything is working fine technically, which makes me wonder if the database contents are different. Can you try doing a search for that first pseudomonas match from refseq in your database, and see what it reports?

I am not 100% clear on what is in RefSeq, vs what is in GenBank.

This is the first pseudomonas in your database, and its assembly is GCF_000504485.1

So I try to search this assembly in my database directory before index, and it can find the .sig file.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

J-I-P commented 3 years ago

There is my command: find . -name '*.sig' | xargs sourmash index -k 31 ../NCBI_test_dir.sbt.zip thanks for your reply. :)

ctb commented 3 years ago

hi nicole,

I think you’re running into the same problem as before, and xargs is only passing in a few signatures at a time.

Try instead:

sourmash index -k31 output.sbt.zip ./

which will load in all the signature files in the current directory.

best, —titus

On Jan 17, 2021, at 12:05 PM, Nicole notifications@github.com wrote:

There is my command: find . -name '*.sig' | xargs sourmash index -k 31 ../NCBI_test_dir.sbt.zip thanks for your reply. :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

J-I-P commented 3 years ago

But this command will occur ERROR like this : OSError: Error while reading signatures from './'.

ctb commented 3 years ago

On Sun, Jan 17, 2021 at 12:59:33PM -0800, Nicole wrote:

But this command will occur ERROR like this : OSError: Error while reading signatures from './'.

hmm, what version of sourmash are you using? 3.5.x should support that.

you might need to add --traverse-directory -f for your version.

--t

ctb commented 3 years ago

sorry, I was incorrect, the new behavior is on by default in only the pre-4.0 releases. In 3.5.0, you'll need to provide --traverse-directory to sourmash index in order to use ./ as an argument successfully.

J-I-P commented 3 years ago

I will try to use --traverse-directory -f what you suggest. Thanks for your advice. :) 👍

ctb commented 3 years ago

welcome & please let me know!

J-I-P commented 3 years ago

welcome & please let me know!

It works perfectly!!!:) But the result also has false-positive like Mash screen. That let me confused, both of these two tools use modulo to sketch the scale. The biggest difference is the reference database? SBT has more expandable than .msh? ( Mash screen rather than use a bloom filter, they use an exact, streaming method to identify which sketch values are found in the sample. Because they think bloom filter will cause false-positive) Is there another advantage of sourmash is better than Mash screen? Of course, you provide python API is more convenient. 👍 Thanks for advice on this!

ctb commented 3 years ago

Hi Nicole,

I’ll address the differences and similarities between mash and sourmash in another message, but our experience is that we don’t observe many false positives. How are you benchmarking the results?

We have some workflows in development that compare hash values (as output by sourmash) with mapped reads (using minimap2). I’d be happy to point you at them if you are interested in comparing the output of sourmash gather output and the mapping rates of metagenome reads to the references. However, at the moment they are pretty focused on mapping to NCBI genomes, so it wouldn’t be immediately usable for custom databases. (see github.com/dib-lab/genome-grist for the tool).

best, —titus

J-I-P commented 3 years ago

Hi Nicole, I’ll address the differences and similarities between mash and sourmash in another message, but our experience is that we don’t observe many false positives. How are you benchmarking the results? We have some workflows in development that compare hash values (as output by sourmash) with mapped reads (using minimap2). I’d be happy to point you at them if you are interested in comparing the output of sourmash gather output and the mapping rates of metagenome reads to the references. However, at the moment they are pretty focused on mapping to NCBI genomes, so it wouldn’t be immediately usable for custom databases. (see github.com/dib-lab/genome-grist for the tool). best, —titus

You don't observe many false-positives with default --threshold-bp (50kbp)? My test data is from [zymo community].(https://files.zymoresearch.com/protocols/_d6310_zymobiomics_microbial_community_standard_ii_(log_distribution).pdf) Before I use sourmash, I assembly it. This table is the composition of my data (metagenome) image I try to use sourmash gather with --threshold-bp to figure out that eight bacteria (Listeria monocytogenes (89.1%) , Pseudomonas aeruginosa (8.9%), Bacillus subtilis (0.89%), Escherichia coli (0.089%), Salmonella enterica (0.089%), Lactobacillus fermentum (0.089%), Enterococcus faecalis (0.00089%), Staphylococcus aureus (0.000089%)). When I use --threshold-bp by default, and use your refseq-database the result is only figured out Listeria monocytogenes, Pseudomonas aeruginosa, Bacillus subtilis, Escherichia coli, Salmonella enterica. Is that because the other three bacteria are too less reads to figure out? (or your ref-seq does not have other three bacteria data?)

J-I-P commented 3 years ago

Hi Nicole, I’ll address the differences and similarities between mash and sourmash in another message, but our experience is that we don’t observe many false positives. How are you benchmarking the results? We have some workflows in development that compare hash values (as output by sourmash) with mapped reads (using minimap2). I’d be happy to point you at them if you are interested in comparing the output of sourmash gather output and the mapping rates of metagenome reads to the references. However, at the moment they are pretty focused on mapping to NCBI genomes, so it wouldn’t be immediately usable for custom databases. (see github.com/dib-lab/genome-grist for the tool). best, —titus

You don't observe many false-positives with default --threshold-bp (50kbp)? My test data is from [zymo community].(https://files.zymoresearch.com/protocols/_d6310_zymobiomics_microbial_community_standard_ii_(log_distribution).pdf) Before I use sourmash, I assembly it. This table is the composition of my data (metagenome) image I try to use sourmash gather with --threshold-bp to figure out that eight bacteria (Listeria monocytogenes (89.1%) , Pseudomonas aeruginosa (8.9%), Bacillus subtilis (0.89%), Escherichia coli (0.089%), Salmonella enterica (0.089%), Lactobacillus fermentum (0.089%), Enterococcus faecalis (0.00089%), Staphylococcus aureus (0.000089%)). When I use --threshold-bp by default, and use your refseq-database the result is only figured out Listeria monocytogenes, Pseudomonas aeruginosa, Bacillus subtilis, Escherichia coli, Salmonella enterica. Is that because the other three bacteria are too less reads to figure out? (or your ref-seq does not have other three bacteria data?)

When I use sourmash gather with --threshold-bp=0 and refseq-database, that three species is not in the result, but occur another species (except that eigth species)

J-I-P commented 3 years ago

And I'm so curious that how you reduce false-positive when you use bloom filter? (use more hash functions or some other method?)

ctb commented 3 years ago

so many questions 😆

let's see -

first, there's an excellent description of sourmash (with comparisons to mash screen and cmash) in @luizirber thesis, here. Chapters 1-3 are the most relevant. We are working on putting that into a publication, but it's still in early draft stage here.

Some specific answers -

Bloom filters do present false positives, but with SBTs we use them as an aid in searching, not as the final match. So the list of signatures you get with scores and so on are the actual hash matches, not anything to do with Bloom filters. You can try this out by using --save-matches with sourmash gather, which will output just a list of the matching signatures; you'll get the same result running sourmash gather <query.sig> <saved_matches.sig> as you do running sourmash gather <query.sig> <database.sbt.zip>. (If not, it's a bug ;).

mash uses MinHash, sourmash with scaled uses a variant of modulo hash. They are roughly equivalent for similarity calculations, but sourmash with scaled allows containment searches on already-sketched databases of genomes, while mash screen does a separate thing that requires re-sketching the raw data.

Our results with genome-grist and in the sourmash draft paper, above, confirm to us that we in general do not get false positives with sourmash. (You wouldn't expect to, statistically, either!) If anything we worry more about false negatives due to the very high specificity of k-mers.

I haven't looked at the Zymo community myself, but a few notes -

I hope that all helps :).

The biggest thing I would suggest is redoing the analysis without an assembled metagenome, i.e. by recalculating the query on the raw reads.

The second biggest thing I would suggest (by way of benchmarking...) is to compare read mapping of the metagenome to the genomes that are supposed to be there to the mapping to the genomes that sourmash is finding. I would expect the results to match w/sourmash's results, i.e. that sourmash is finding strains and genomes that are really there. (This is a lot of work tho!)

ctb commented 3 years ago

p.s. if you can give me the .sig file calculated from the raw reads with -k 31 --scaled=1000 I can run it against all 700,000 genbank genomes and give you the mapping breakdown to the sourmash matches using genome-grist. Just drop me the signature file at titus@idyll.org.

ctb commented 3 years ago

Closing for now.