Open chunyuma opened 1 year ago
thanks @chunyuma I'm not sure what exactly is going on but I'd love to help debug!
It sounds like it may be the third item from https://github.com/sourmash-bio/sourmash/issues/1849, to whit:
third, there are some interesting corner cases popping up in #1837 where the manifest may (or may not) contain all signatures in the database. One specific case is
ZipFileLinearIndex
, where if the manifest was generated withtraverse_all_files
, it may contain signatures from files that don't have .sig in the name. This results in oddities where you get different reports out ofsourmash sig fileinfo
depending on whether you've asked it to regenerate the manifest or not: for example, if you're looking attests/test-data/prot/all.zip
, the included manifest does containdna-sig.noext
, but if you regenerate the manifest from an index loaded withouttraverse_all_files=True
, you'll exclude it. See thetest_fileinfo_4_zip*
tests as well as thetest_sig_manifest_7_allzip
tests for tests that explore this behavior.
but I'm not sure why that would happen internally to sourmash!
If you can paste a few example lines in that differ b/t the manifests that might be helpful?
note that some manifest behavior is just weird and dumb - e.g. sig manifest
regenerates the manifest by default !? per https://github.com/sourmash-bio/sourmash/issues/2034 - and that's all my fault 😠.
ref also https://github.com/sourmash-bio/sourmash/issues/2033 re fromfile
.
thanks for your help in debugging, @ctb.
Here is a simple example that might help you for debug:
In this example, I have two genomes that are quite close in taxonomy. After I built a custom database based on these two genomes via:
sourmash sketch fromfile datasets.csv -p k=31,scaled=100,dna,abund -o database.zip
I unzip the database.zip
file and got a SOURMASH-MANIFEST.csv
file:
# SOURMASH-MANIFEST-VERSION: 1.0
internal_location,md5,md5short,ksize,moltype,num,scaled,n_hashes,with_abundance,name,filename
signatures/331a7076f08e2b1d671aedf70cd63230.sig.gz,331a7076f08e2b1d671aedf70cd63230,331a7076,31,DNA,0,100,91554,1,genome1,/home/xxx/genome1.fasta
signatures/331a7076f08e2b1d671aedf70cd63230.sig.gz_0,331a7076f08e2b1d671aedf70cd63230,331a7076,31,DNA,0,100,91554,1,genome2,/home/xxx/genome2.fna.gz
As you can see, sourmash generates the same md5 for them.
When I looked at the files under the signatures
folder that generated by unzipping database.zip, I found that there are two signature file: one's name is 331a7076f08e2b1d671aedf70cd63230.sig.gz
and another is 331a7076f08e2b1d671aedf70cd63230.sig.gz_0
.
However, when I did sourmash sig manifest data.zip -o manifest.csv
, I had the following information:
# SOURMASH-MANIFEST-VERSION: 1.0
internal_location,md5,md5short,ksize,moltype,num,scaled,n_hashes,with_abundance,name,filename
signatures/331a7076f08e2b1d671aedf70cd63230.sig.gz,331a7076f08e2b1d671aedf70cd63230,331a7076,31,DNA,0,100,91554,1,genome1,/home/xxx/genome1.fasta
Only genome1 has been shown. I know perhaps sourmash considers them as the "same" genomes. I'm wondering if both genomes will be used in the downstream sourmash
analysis (e.g., sourmash compare). Thanks!
yep, this inconsistency is concerning :). Looking into it! I'm tempted to say it's going to be fixed by https://github.com/sourmash-bio/sourmash/pull/2747 but I think I may be being overoptimistic 😆
Only genome1 has been shown. I know perhaps sourmash considers them as the "same" genomes. I'm wondering if both genomes will be used in the downstream
sourmash
analysis (e.g., sourmash compare). Thanks!
If you're using the zip file with the "correct" (dual entry) SOURMASH-MANIFEST.csv
then they should both be used!
I can replicate:
cp podar-ref/47.fa dup/47a.fa
cp podar-ref/47.fa dup/47b.fa
cd dup
sourmash sketch dna -p k=31 *
sourmash sig cat *.sig -o dup.zip
sourmash sig summarize dup.zip # shows 2 sketches in zip
unzip dup.zip
cat SOURMASH-MANIFEST.csv
# shows 2 sketches in CSV
sourmash sig manifest dup.zip -o mf.csv
cat mf.csv
# shows 1 sketch
Ooh, this is a nifty bug all right. 🤯
Briefly, ZipStorage
is storing the duplicate filename in a way that is incompatible with the default manifest-independent loading code use in ZipFileLinearIndex
-
signatures/09a08691ce52952152f0e866a59f6261.sig.gz_0
instead of
signatures/09a08691ce52952152f0e866a59f6261_0.sig.gz
This is only triggered in unusual circumstances but is still a bug... gotta think on resolution!
OK, confirmed the underlying code - the FSStorage
and ZipStorage
classes in sbt_storage.py
both avoid overwriting existing files by appending _{n}
to filenames until there is no existing filename. This causes problems in code that relies on specific filename extensions (which is one reason we often try to avoid that, although perspectives differ ;).
Regardless, this behavior is the root cause of the discrepancy noted here!
(And this means this problem should crop up with FSStorage
, too.)
It doesn't matter at all for manifests generated when saving sketches, because there the correct filename is recorded (and if a filename is in the manifest, it doesn't matter what the filename is - it's loaded as a signature file).
I see a few different possible resolutions -
{n}_
as a prefix rather than _{n}
as a suffix. Suboptimal for directory listings but for big collections those aren't useful anyway. This is a general fix.{n}
internally, e.g. at end of first component of filename, or something - basically, avoid changing the extension - this is a general fix. It costs more compute than the previous fix, though, because we'll have to inspect the filename in some way.traverse_yield_all
is True, so that when creating a manifest it always tries to load the file, independent of filename. This would break semantic versioning so would need to wait 'til v5. Could be part of behavior change to sig manifest
in https://github.com/sourmash-bio/sourmash/issues/2034.I'm leaning towards one of the first two now, and also the third for v5? I don't think either of the first two changes would break anything in sourmash currently, but of course... we'll have to see! I'll poke around at this as I have time.
Either way @chunyuma I understand the problem and hopefully it won't affect your work - and if you need sig manifest
to work consistently, either use -f
or --no-rebuild-manifest
and you should get consistent manifests out. Thanks again for reporting this!!
Thanks @ctb for your hard working on this issue. I'm glad to figure out the issue and find the potential solutions for it. And thanks for your suggestions.
Hello,
This issue seems a bug but I'm not sure.
I recently utilized
sourmash sketch fromfile
command to generate a custom database. However, the genome set may have some duplicates. After I usedsourmash sketch fromfile
to generate a database.zip file, I usedsourmash sig manifest
to extract its manifest metadata file. I also try to unzip the database.zip file and there is aSOURMASH-MANIFEST.csv
file.One interesting thing I found is that the row in the manifest file generated from
sourmash sig manifest
is different from that inSOURMASH-MANIFEST.csv
file. I found that those missing rows are duplicated genomes (the genomes with the same md5). I'm wondering if these duplicated genomes will be used in the downstreamsourmash
command. Thanks!