sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
467 stars 80 forks source link

manifest file is not consistent between `sourmash sig manifest` and `sourmash sketch fromfile` #2749

Open chunyuma opened 1 year ago

chunyuma commented 1 year ago

Hello,

This issue seems a bug but I'm not sure.

I recently utilized sourmash sketch fromfile command to generate a custom database. However, the genome set may have some duplicates. After I used sourmash sketch fromfile to generate a database.zip file, I used sourmash sig manifest to extract its manifest metadata file. I also try to unzip the database.zip file and there is a SOURMASH-MANIFEST.csv file.

One interesting thing I found is that the row in the manifest file generated from sourmash sig manifest is different from that in SOURMASH-MANIFEST.csv file. I found that those missing rows are duplicated genomes (the genomes with the same md5). I'm wondering if these duplicated genomes will be used in the downstream sourmash command. Thanks!

ctb commented 1 year ago

thanks @chunyuma I'm not sure what exactly is going on but I'd love to help debug!

It sounds like it may be the third item from https://github.com/sourmash-bio/sourmash/issues/1849, to whit:

third, there are some interesting corner cases popping up in #1837 where the manifest may (or may not) contain all signatures in the database. One specific case is ZipFileLinearIndex, where if the manifest was generated with traverse_all_files, it may contain signatures from files that don't have .sig in the name. This results in oddities where you get different reports out of sourmash sig fileinfo depending on whether you've asked it to regenerate the manifest or not: for example, if you're looking at tests/test-data/prot/all.zip, the included manifest does contain dna-sig.noext, but if you regenerate the manifest from an index loaded without traverse_all_files=True, you'll exclude it. See the test_fileinfo_4_zip* tests as well as the test_sig_manifest_7_allzip tests for tests that explore this behavior.

but I'm not sure why that would happen internally to sourmash!

If you can paste a few example lines in that differ b/t the manifests that might be helpful?

note that some manifest behavior is just weird and dumb - e.g. sig manifest regenerates the manifest by default !? per https://github.com/sourmash-bio/sourmash/issues/2034 - and that's all my fault 😭 .

ref also https://github.com/sourmash-bio/sourmash/issues/2033 re fromfile.

chunyuma commented 1 year ago

thanks for your help in debugging, @ctb.

Here is a simple example that might help you for debug:

In this example, I have two genomes that are quite close in taxonomy. After I built a custom database based on these two genomes via:

sourmash sketch fromfile datasets.csv -p k=31,scaled=100,dna,abund -o database.zip

I unzip the database.zip file and got a SOURMASH-MANIFEST.csv file:

# SOURMASH-MANIFEST-VERSION: 1.0
internal_location,md5,md5short,ksize,moltype,num,scaled,n_hashes,with_abundance,name,filename
signatures/331a7076f08e2b1d671aedf70cd63230.sig.gz,331a7076f08e2b1d671aedf70cd63230,331a7076,31,DNA,0,100,91554,1,genome1,/home/xxx/genome1.fasta
signatures/331a7076f08e2b1d671aedf70cd63230.sig.gz_0,331a7076f08e2b1d671aedf70cd63230,331a7076,31,DNA,0,100,91554,1,genome2,/home/xxx/genome2.fna.gz

As you can see, sourmash generates the same md5 for them.

When I looked at the files under the signatures folder that generated by unzipping database.zip, I found that there are two signature file: one's name is 331a7076f08e2b1d671aedf70cd63230.sig.gz and another is 331a7076f08e2b1d671aedf70cd63230.sig.gz_0.

However, when I did sourmash sig manifest data.zip -o manifest.csv, I had the following information:

# SOURMASH-MANIFEST-VERSION: 1.0
internal_location,md5,md5short,ksize,moltype,num,scaled,n_hashes,with_abundance,name,filename
signatures/331a7076f08e2b1d671aedf70cd63230.sig.gz,331a7076f08e2b1d671aedf70cd63230,331a7076,31,DNA,0,100,91554,1,genome1,/home/xxx/genome1.fasta

Only genome1 has been shown. I know perhaps sourmash considers them as the "same" genomes. I'm wondering if both genomes will be used in the downstream sourmash analysis (e.g., sourmash compare). Thanks!

ctb commented 1 year ago

yep, this inconsistency is concerning :). Looking into it! I'm tempted to say it's going to be fixed by https://github.com/sourmash-bio/sourmash/pull/2747 but I think I may be being overoptimistic 😆

Only genome1 has been shown. I know perhaps sourmash considers them as the "same" genomes. I'm wondering if both genomes will be used in the downstream sourmash analysis (e.g., sourmash compare). Thanks!

If you're using the zip file with the "correct" (dual entry) SOURMASH-MANIFEST.csv then they should both be used!

ctb commented 1 year ago

I can replicate:

cp podar-ref/47.fa dup/47a.fa
cp podar-ref/47.fa dup/47b.fa
cd dup
sourmash sketch dna -p k=31 *
sourmash sig cat *.sig -o dup.zip
sourmash sig summarize dup.zip # shows 2 sketches in zip
unzip dup.zip
cat SOURMASH-MANIFEST.csv
# shows 2 sketches in CSV

sourmash sig manifest dup.zip -o mf.csv
cat mf.csv
# shows 1 sketch
ctb commented 1 year ago

Ooh, this is a nifty bug all right. 🤯

Briefly, ZipStorage is storing the duplicate filename in a way that is incompatible with the default manifest-independent loading code use in ZipFileLinearIndex -

signatures/09a08691ce52952152f0e866a59f6261.sig.gz_0

instead of

signatures/09a08691ce52952152f0e866a59f6261_0.sig.gz

This is only triggered in unusual circumstances but is still a bug... gotta think on resolution!

ctb commented 1 year ago

OK, confirmed the underlying code - the FSStorage and ZipStorage classes in sbt_storage.py both avoid overwriting existing files by appending _{n} to filenames until there is no existing filename. This causes problems in code that relies on specific filename extensions (which is one reason we often try to avoid that, although perspectives differ ;).

Regardless, this behavior is the root cause of the discrepancy noted here! (And this means this problem should crop up with FSStorage, too.)

It doesn't matter at all for manifests generated when saving sketches, because there the correct filename is recorded (and if a filename is in the manifest, it doesn't matter what the filename is - it's loaded as a signature file).

I see a few different possible resolutions -

I'm leaning towards one of the first two now, and also the third for v5? I don't think either of the first two changes would break anything in sourmash currently, but of course... we'll have to see! I'll poke around at this as I have time.

Either way @chunyuma I understand the problem and hopefully it won't affect your work - and if you need sig manifest to work consistently, either use -f or --no-rebuild-manifest and you should get consistent manifests out. Thanks again for reporting this!!

chunyuma commented 1 year ago

Thanks @ctb for your hard working on this issue. I'm glad to figure out the issue and find the potential solutions for it. And thanks for your suggestions.