Closed ctb closed 2 years ago
insert
was implemented in #946.
Maybe Index
objects should support remove, and then we can add command line options for dealing with that.
ref #949 maybe? and/or have a general CLI submodule for dealing with databases.
Maybe
Index
objects should support remove, and then we can add command line options for dealing with that.
Intriguing. SBTs can support remove too (might need to think how to keep them dense, tho), and for the LinearIndex it's trivial. And for future indices that don't support it, throw an error?
ref #949 maybe? and/or have a general CLI submodule for dealing with databases.
I thought about using index
as that CLI submodule, but it would complicate the current use case too much.
Other potential operations in SBTs that would fit a database CLI:
migrate
(from going from a previous SBT version to a new one)convert
(from one database format to another)prepare
(from a leaf-only SBT to a full SBT)rescale
? (although we can pass a different --scaled
during query)k
too. If it was indexed with k=21
, all the internal nodes have those k
, but the tree structure can be used for k=51
too (but need to rebuild all internal nodes)could also provide a screen or mask option, which would be useful for benchmarking / hold-one-out studies (I kinda need this for charcoal). see also #985.
for testing/evaluation purposes where performance is not a big issue, using search --containment
instead of gather
and then removing matches based on md5sum should address my needs for e.g. hold-one-out.
random thoughts from today. for testing/evaluation --
This would be very helpful for me right now. I have a really strange issue with an sbt.zip db that I built: although I am only passing unique signatures to sourmash index
(as confirmed with sort | uniq -c
run on the input file that contains the paths to all the signatures: genomes/all/GCA/gca_latest_genomic_sig_minus1.txt
), I get astronomical duplication of some signatures in my db:
With v3.4.1rc1, I set up my db as follows:
sourmash index -k 31 /data/ncbi/genomes/all/GCA/gca_genbank.sbt.zip /data/ncbi/genomes/all/GCA/000/001/215/GCA_000001215.4_Release_6_plus_ISO1_MT/GCA_000001215.4_Release_6_plus_ISO1_MT_genomic.sig --from-file genomes/all/GCA/gca_latest_genomic_sig_minus1.txt
After setting up my db and finding that sourmash gather
works, I run:
sourmash sig describe /data/ncbi/genomes/all/GCA/gca_genbank.sbt.zip > gca_genbank.sbt.zip.manifest #put a list of the signatures into a text file
grep "^signature:" gca_genbank.sbt.zip.manifest | awk '{print $2}' | sort | uniq -c | sort -nr > gca_accession.count #get a count of duplicated accession numbers from signature names
(I did the above, because while trying to build an lca db from the signatures in the sbt.zip, sourmash lca index
complained that I had signatures with duplicate names and failed.)
And if I head
the file with counts of the accessions (calculated by uniq -c
), I get this:
1261 MG522850.1
1231 KY084478.1
713 MF677845.1
317 KX266637.1
268 ODAX01000001.1
161 MRLE01000001.1
160 AAITDJ010000001.1
132 AAJDZY010000001.1
107 CP018035.1
98 CP013906.1
That is, in two cases - accession # MG522850.1 and KY084478.1 - more than 1,000 replicated signatures. There are 14,680 signatures that are at least duplicated, and the remaining 707,167 are unique signatures, for a total of 721,847 signatures (from the NCBI genomes GCA* db).
So, it would be really handy to be able to remove signatures, especially en masse or with reg-ex after loading the sbt.zip db into memory once. Can you please suggest an efficient way to do this with the Python API? Thanks!
first of all - wow, that's a big database 👍
second of all - where did the duplicate accessions come from!? either sbt index
is adding signatures multiple times, or the duplicate accessions are in the input signatures... you checked the filenames, but do you know if there are different files with the same accession?
third, a stopgap measure could be to do sourmash sig cat --unique
which will eliminate duplicate signatures based on their hash content.
And if I
head
the file with counts of the accessions (calculated byuniq -c
), I get this:1261 MG522850.1 1231 KY084478.1 713 MF677845.1 317 KX266637.1 268 ODAX01000001.1 161 MRLE01000001.1 160 AAITDJ010000001.1 132 AAJDZY010000001.1 107 CP018035.1 98 CP013906.1
That is, in two cases - accession # MG522850.1 and KY084478.1 - more than 1,000 replicated signatures. There are 14,680 signatures that are at least duplicated, and the remaining 707,167 are unique signatures, for a total of 721,847 signatures (from the NCBI genomes GCA* db).
Looking at these names... are you using --name-from-first
to build the signatures? I moved away from that in sourmash_databases because it generated a lot of confusion...
For debugging purposes, can you also check the head -1
for all the FASTA files, and see if they are all unique? And, if they aren't, if they match the numbers above?
Hi, so overnight I ran this:
time while read P; do zcat $P | head -1; done < gca_latest_genomic_fna.txt > gca_latest_genomic_fna_accessions.txt
where gca_latest_genomic_fna.txt
is a textfile containing >750k fna.gz files over which I ran sourmash compute
. After 3 hours that finished and I looked at the contents of the result this morning and got:
cut -c 2- gca_latest_genomic_fna_accessions.txt | awk '{print $1}' | sort | uniq -c | sort -nr | head -100
3 EU339936.1
2 X68295.1
2 MUGM01000001.1
2 MK380015.1
2 MK380014.1
2 MK072507.1
2 MK072498.1
2 MK072489.1
2 MK072437.1
2 MK072383.1
2 MK072332.1
2 MK072243.1
2 MK072199.1
2 MK072132.1
2 MK072066.1
2 MK072042.1
2 MK071998.1
2 MK071979.1
2 MH614262.1
2 MH614261.1
2 MH171300.1
2 MG747435.1
2 MF477236.1
2 LYWJ01000001.1
2 LN879430.1
2 LN850107.1
2 LN813019.1
2 KJ944830.1
2 JX181822.1
2 JX181814.1
2 DQ347950.1
2 CP009805.1
2 CP009075.1
2 CP004500.1
2 CP004495.1
2 CP004493.1
2 CP004492.1
2 CP004484.1
2 CP004479.1
2 CP004472.1
2 CP004468.1
2 CP004466.1
2 CP004431.1
2 CM000462.1
2 AP018280.1
1 Z98198.1
1 Z97873.1
1 Z86099.2
1 Z73124.1
1 Z69910.1
1 Z69620.1
1 Z66548.1
1 Z48731.1
1 Z48630.1
1 Z48506.1
1 Z48182.1
1 Z47794.1
1 Z46351.1
1 Z26920.1
1 Z25771.1
1 Z24758.1
1 Z21647.1
1 Z18946.1
1 Z11866.1
1 Z11128.1
1 Y18263.1
1 Y17873.1
1 Y17017.1
1 Y17014.1
1 Y16780.1
1 Y16627.1
1 Y16104.1
1 Y15937.2
1 Y15936.2
1 Y15934.1
1 Y15176.1
1 Y15175.1
1 Y15174.1
1 Y15173.1
1 Y15034.1
1 Y14874.1
1 Y14700.1
1 Y14570.1
1 Y14365.1
1 Y13918.1
1 Y13463.1
1 Y13184.1
1 Y13051.1
1 Y12083.1
1 Y11604.1
1 Y11530.1
1 Y11099.1
1 Y11097.1
1 Y11023.2
1 Y10973.1
1 Y10237.1
1 Y09921.1
1 Y09854.1
1 Y09762.1
1 Y09598.1
So there are a few duplicated accession numbers in the fna.gz files, but not thousands of them.
Next step is running sourmash sig describe
over the individual sigs computed by sourmash compute
on each of the individual fna.gz files. That will take a few hours and I may be away from my local server for a week before I can report results from that check.
Re:
Looking at these names... are you using --name-from-first to build the signatures? I moved away from that in sourmash_databases because it generated a lot of confusion...
Yes, I ran with --name-from-first.
Here is my sourmash compute
command:
cat /data/ncbi/genomes/all/GCA/gca_latest_genomic_fna.txt | parallel -j 60 'sourmash compute -k 21,31,51 -f --name-from-first --track-abundance --scaled 1000 --output "{= s:\.[^.]+$::;s:\.[^.]+$::; =}".sig {}'
Well, one thing seems clear; I'm not using current best practices. A lot of what I've done is based on what I read in the docs here: https://sourmash.readthedocs.io/en/latest/
and the walkthrough here: https://github.com/dib-lab/2018-ncbi-lineages
but it seems like I should be following: https://github.com/dib-lab/sourmash_databases
Does that seem accurate?
This is not your fault... We didn't move the new way from sourmash_databases into the docs, and the PR isn't even merged there...
But most of the changes in the PR are because --name-from-first
makes it hard to figure out taxonomy later. I modified sourmash_databases to set a custom name based on "Accession, name, strain, assembly" because then we have consistent accessions to map to taxonomy IDs later (and the other info is useful for display). And because it is now based on the assembly_summary.txt
it is also easier to rebuild/update later (because you just need an assembly_summary.txt
to get all the info/download all the genomes)
Re:
And because it is now based on the assembly_summary.txt it is also easier to rebuild/update later (because you just need an assembly_summary.txt to get all the info/download all the genomes)
This is an excellent idea. I started using that file myself to build my own mapping file rather than the 2018-ncbi... repository walkthrough, because all the data is right there in a single file, separate columns: accession, taxid, etc
Re:
This is not your fault... We didn't move the new way from sourmash_databases into the docs, and the PR isn't even merged there...
I don't mind one way or the other if it's my fault; I make lots of mistakes and I'm happy to throw more onto the pile, haha. I just wanted you to have some idea of what "fresh eyes" see when they come to sourmash. It might help with docs, etc. I will start working from sourmash_databases and see what I can manage...
Thanks for your help!!
Hello, I have the results from doing the following (from above):
Next step is running sourmash sig describe over the individual sigs computed by sourmash compute on each of the individual fna.gz files. That will take a few hours and I may be away from my local server for a week before I can report results from that check.
3 EU339936.1 Rhynchosia golden mosaic virus strain 1045 segment DNA-A, complete sequence
2 X68295.1 Mycobacteriophage I3 gene for structural protein (70 KD)
2 MUGM01000001.1 Calypte anna isolate BGI_N300 000239F, whole genome shotgun sequence
2 MK380015.1 Klebsiella phage Kund-ULIP47, complete genome
2 MK380014.1 Klebsiella phage K1-ULIP33, complete genome
2 MK072507.1 Sylvanvirus sp. clone Sylvanvirus_1 genomic sequence
2 MK072498.1 Solumvirus sp. clone Solumvirus_1 genomic sequence
2 MK072489.1 Solivirus sp. clone Solivirus_1 genomic sequence
2 MK072437.1 Satyrvirus sp. clone Satyrvirus_1 genomic sequence
2 MK072383.1 Hyperionvirus sp. clone Hyperionvirus_1 genomic sequence
2 MK072332.1 Homavirus sp. clone Homavirus_1 genomic sequence
2 MK072243.1 Harvfovirus sp. clone Harvfovirus_1 genomic sequence
2 MK072199.1 Gaeavirus sp. clone Gaeavirus_1 genomic sequence
2 MK072132.1 Faunusvirus sp. clone Faunusvirus_1 genomic sequence
2 MK072066.1 Edafosvirus sp. clone Edafosvirus_1 genomic sequence
2 MK072042.1 Dasosvirus sp. clone Dasosvirus_1 genomic sequence
2 MK071998.1 Barrevirus sp. clone Barrevirus_1 genomic sequence
2 MK071979.1 Terrestrivirus sp. clone Terrestrivirus_1 genomic sequence
2 MH614262.1 Bird's-foot trefoil nucleorhabdovirus isolate LC, complete genome
2 MH614261.1 Bird's-foot trefoil enamovirus 1 isolate LC, complete genome
2 MH171300.1 Marine RNA virus BC-4 non-structural polyprotein and structural polyprotein genes, complete cds
2 MF477236.1 Nocardia phage NTR1, complete genome
2 LYWJ01000001.1 Homo sapiens isolate AK1 A_00590001_001, whole genome shotgun sequence
2 LN879430.1 Herbinix sp. SD1D genome assembly SD1D, chromosome : I
2 LN850107.1 Alloactinosynnema sp. L-07 genome assembly Alloactinosynnema sp. L-07, chromosome : I
2 LN813019.1 Halomonas sp. R57-5 genome assembly HalomonasR57-5, chromosome : I
2 KJ944830.1 Pseudoalteromonas phage B8b, partial genome
2 JX181822.1 Salmonella phage SPT-1, partial genome
2 JX181814.1 Salmonella phage SBA-1781, partial genome
2 DQ347950.1 Rhynchosia golden mosaic virus isolate Soybean 1068 segment DNA-A, complete sequence
2 CP009805.1 Botrytis cinerea B05.10 chromosome BCIN01, complete sequence
2 CP009075.1 Verticillium dahliae JR2 chromosome 1, complete sequence
2 CP004500.1 Saccharomyces cerevisiae YJM1592 chromosome I genomic sequence
2 CP004495.1 Saccharomyces cerevisiae YJM1526 chromosome I genomic sequence
2 CP004493.1 Saccharomyces cerevisiae YJM1478 chromosome I genomic sequence
2 CP004492.1 Saccharomyces cerevisiae YJM1477 chromosome I genomic sequence
2 CP004484.1 Saccharomyces cerevisiae YJM1434 chromosome I genomic sequence
2 CP004479.1 Saccharomyces cerevisiae YJM1415 chromosome I genomic sequence
2 CP004472.1 Saccharomyces cerevisiae YJM1387 chromosome I genomic sequence
2 CP004468.1 Saccharomyces cerevisiae YJM1381 chromosome I genomic sequence
2 CP004466.1 Saccharomyces cerevisiae YJM1355 chromosome I genomic sequence
2 CP004431.1 Saccharomyces cerevisiae YJM683 chromosome I genomic sequence
2 CM000462.1 Homo sapiens chromosome 1, whole genome shotgun sequence
2 AP018280.1 Calothrix sp. NIES-4101 DNA, complete genome
1 Z98198.1 Peanut stunt virus satellite RNA (P4-satRNA)
1 Z97873.1 Beet soil-borne virus RNA1 for 145 kDa protein, and 204 kDa protein
1 Z86099.2 Herpes simplex virus type 2 (strain HG52), complete genome
1 Z73124.1 Sweetpotato mild mottle virus mRNA for polyprotein
1 Z69910.1 Groundnut rosette virus complete genome, strain MC1
1 Z69620.1 European brown hare syndrome virus RNA
1 Z66548.1 Puumala virus segment L, genomic RNA, strain Sotkamo
1 Z48731.1 Human immunodeficiency virus type 2 gag, pol, vif, vpx, vpr, tat, rev, nef and env genes
1 Z48630.1 Cocksfoot Mottle Virus genes for polyprotein, RNA dependent RNA polymerase and coat protein
1 Z48506.1 Brome streak mosaic rymovirus polyprotein RNA
1 Z48182.1 Tomato leaf curl virus - Bangalore I V1, V2, C1, C2, C3 and C4 genes
1 Z47794.1 Streptococcus phage Cp-1 DNA, complete genome
1 Z46351.1 Lychnis ringspot virus RNA for beta-A, beta-B, beta-C, beta-D proteins
1 Z26920.1 Johnson grass mosaic virus gene for protease 1 and 3, helper component 6K protein, coat protein, nuclear inclusion proteins
1 Z25771.1 Human astrovirus type 1 genes for capsid protein and nonstructural protein
1 Z24758.1 Indian cassava mosaic virus encoding AR0 complete CDS
1 Z21647.1 P.asiatica mosaic virus genomic RNA
1 Z18946.1 Mycobacterium phage L5 complete genome
1 Z11866.1 Cladosporium fulvum T-1 virus LTR-retrotransposon encoding homologues to retroviral gag, pol, and env genes
1 Z11128.1 Friend murine leukemia virus FB29 complete genome
1 Y18263.1 Viral hemorrhagic septicemia virus Fil3 complete genome
1 Y17873.1 Thogoto virus genomic RNA for PB2 polymerase subunit
1 Y17017.1 Human T-cell lymphotropic virus type 1 DNA, long terminal repeat (patient Lib2)
1 Y17014.1 Human T-cell lymphotropic virus type 1 DNA, long terminal repeat (patient Efe1)
1 Y16780.1 variola minor virus complete genome
1 Y16627.1 Ovine enzootic nasal tumour virus complete sequence
1 Y16104.1 Physalis mottle tymovirus OP, RP, CP genes
1 Y15937.2 Sheep astrovirus, complete genome, genomic RNA
1 Y15936.2 Avastrovirus 1 complete genome, genomic RNA
The signature counts from the individual signatures computed with sourmash compute
match the counts of the fna files from which they were computed. So it's when I run sourmash index that the signatures are massively replicated:
sourmash index -k 31 /data/ncbi/genomes/all/GCA/gca_genbank.sbt.zip /data/ncbi/genomes/all/GCA/000/001/215/GCA_000001215.4_Release_6_plus_ISO1_MT/GCA_000001215.4_Release_6_plus_ISO1_MT_genomic.sig --from-file genomes/all/GCA/gca_latest_genomic_sig_minus1.txt
Thanks again for helping to solve this. Regarding using sourmash sig cat --unique
, that is a great idea (and a handy cli tool), but I'm afraid that the process is killed before it completes. I'll try another approach if you can suggest one, try to figure something out with the Python API, or I'll wait to see what happens with the sourmash index
command in future releases.
I actually decided to try running sourmash compute
without --name-from-first to see if @luizirber's instinct is correct. I will try to index the signatures named for the fna files from which they were computed next.
Thanks for digging down! I will check the DBs I built in https://github.com/dib-lab/sourmash_databases and see if they have duplicates too... If they do, there is a pretty large bug somewhere :fearful:
Hi, @luizirber, I'm happy to help where I can. So, I've gotten the results from running sourmash compute
without the --name-from-first
option over the original fasta files for all genbank assemblies using v3.4.1rc1 as in all above comments.
Here is that command:
cat /data/ncbi/genomes/all/GCA/gca_latest_genomic_fna.txt | parallel -j 60 'sourmash compute -k 21,31,51 --track-abundance --scaled 1000 --output "{= s:\.[^.]+$::;s:\.[^.]+$::; =}"_acc.sig {}'
Then I indexed those signatures (each named for the fna.gz filename from which the signature was computed) as follows:
time sourmash index -k 31 /data/ncbi/genomes/all/GCA/gca_genbank.sbt.zip /data/ncbi/genomes/all/GCA/013/183/675/GCA_013183675.1_ASM1318367v1/GCA_013183675.1_ASM1318367v1_genomic_acc.sig --from-file /data/ncbi/genomes/all/GCA/filename_sig_list_minus1.txt
And 9 hours after the indexing finishes, I run sourmash sig describe
to get a manifest of the signatures in the db and then run a shell pipeline to count the number of signatures:
sourmash sig describe /data/ncbi/genomes/all/GCA/gca_genbank.sbt.zip > filename_sig.sbt.zip.manifest
grep "^signature:" filename_sig.sbt.zip.manifest | awk '{print $2}' | sort | uniq -c | sort -nr > new_gca_accession.count #get counts of signatures
And the result (according to the counts, anyhow) are identical to what I saw when I ran sourmash compute
with the --name-from-first option and then indexed those signatures. So I don't believe that the --name-from-first option is responsible for this behavior. Here are the first few lines from the signature count file, new_gca_accession.count
:
1261 /data/ncbi/genomes/all/GCA/008/800/355/GCA_008800355.1_ASM880035v1/GCA_008800355.1_ASM880035v1_genomic.fna.gz
1231 /data/ncbi/genomes/all/GCA/004/066/835/GCA_004066835.1_ASM406683v1/GCA_004066835.1_ASM406683v1_genomic.fna.gz
713 /data/ncbi/genomes/all/GCA/004/066/735/GCA_004066735.1_ASM406673v1/GCA_004066735.1_ASM406673v1_genomic.fna.gz
317 /data/ncbi/genomes/all/GCA/002/536/245/GCA_002536245.1_ASM253624v1/GCA_002536245.1_ASM253624v1_genomic.fna.gz
268 /data/ncbi/genomes/all/GCA/900/219/735/GCA_900219735.1_SRS053214/GCA_900219735.1_SRS053214_genomic.fna.gz
161 /data/ncbi/genomes/all/GCA/010/078/015/GCA_010078015.1_ASM1007801v1/GCA_010078015.1_ASM1007801v1_genomic.fna.gz
160 /data/ncbi/genomes/all/GCA/008/759/475/GCA_008759475.1_PDT000593948.1/GCA_008759475.1_PDT000593948.1_genomic.fna.gz
132 /data/ncbi/genomes/all/GCA/006/946/905/GCA_006946905.1_PDT000192888.2/GCA_006946905.1_PDT000192888.2_genomic.fna.gz
107 /data/ncbi/genomes/all/GCA/008/821/865/GCA_008821865.1_ASM882186v1/GCA_008821865.1_ASM882186v1_genomic.fna.gz
98 /data/ncbi/genomes/all/GCA/008/821/745/GCA_008821745.1_ASM882174v1/GCA_008821745.1_ASM882174v1_genomic.fna.gz
98 /data/ncbi/genomes/all/GCA/008/378/965/GCA_008378965.1_PDT000583170.1/GCA_008378965.1_PDT000583170.1_genomic.fna.gz
72 /data/ncbi/genomes/all/GCA/006/441/735/GCA_006441735.1_ASM644173v1/GCA_006441735.1_ASM644173v1_genomic.fna.gz
64 /data/ncbi/genomes/all/GCA/010/406/965/GCA_010406965.1_PDT000256503.2/GCA_010406965.1_PDT000256503.2_genomic.fna.gz
60 /data/ncbi/genomes/all/GCA/902/719/495/GCA_902719495.1_12083_7_46/GCA_902719495.1_12083_7_46_genomic.fna.gz
59 /data/ncbi/genomes/all/GCA/009/948/425/GCA_009948425.1_ASM994842v1/GCA_009948425.1_ASM994842v1_genomic.fna.gz
57 /data/ncbi/genomes/all/GCA/006/164/165/GCA_006164165.1_PDT000241615.2/GCA_006164165.1_PDT000241615.2_genomic.fna.gz
55 /data/ncbi/genomes/all/GCA/005/536/435/GCA_005536435.1_PDT000333183.1/GCA_005536435.1_PDT000333183.1_genomic.fna.gz
53 /data/ncbi/genomes/all/GCA/006/701/085/GCA_006701085.1_PDT000317479.1/GCA_006701085.1_PDT000317479.1_genomic.fna.gz
52 /data/ncbi/genomes/all/GCA/008/896/965/GCA_008896965.1_PDT000600819.1/GCA_008896965.1_PDT000600819.1_genomic.fna.gz
52 /data/ncbi/genomes/all/GCA/008/845/205/GCA_008845205.1_PDT000599973.1/GCA_008845205.1_PDT000599973.1_genomic.fna.gz
52 /data/ncbi/genomes/all/GCA/004/036/575/GCA_004036575.1_ASM403657v1/GCA_004036575.1_ASM403657v1_genomic.fna.gz
46 /data/ncbi/genomes/all/GCA/001/166/405/GCA_001166405.1_8525_3_42/GCA_001166405.1_8525_3_42_genomic.fna.gz
44 /data/ncbi/genomes/all/GCA/008/354/005/GCA_008354005.1_PDT000581113.1/GCA_008354005.1_PDT000581113.1_genomic.fna.gz
43 /data/ncbi/genomes/all/GCA/004/084/815/GCA_004084815.1_ASM408481v1/GCA_004084815.1_ASM408481v1_genomic.fna.gz
42 /data/ncbi/genomes/all/GCA/009/903/655/GCA_009903655.1_ASM990365v1/GCA_009903655.1_ASM990365v1_genomic.fna.gz
42 /data/ncbi/genomes/all/GCA/004/066/535/GCA_004066535.1_ASM406653v1/GCA_004066535.1_ASM406653v1_genomic.fna.gz
41 /data/ncbi/genomes/all/GCA/008/721/655/GCA_008721655.1_PDT000592674.1/GCA_008721655.1_PDT000592674.1_genomic.fna.gz
39 /data/ncbi/genomes/all/GCA/012/004/665/GCA_012004665.1_PDT000717166.1/GCA_012004665.1_PDT000717166.1_genomic.fna.gz
38 /data/ncbi/genomes/all/GCA/900/052/195/GCA_900052195.1_10678_7_85/GCA_900052195.1_10678_7_85_genomic.fna.gz
38 /data/ncbi/genomes/all/GCA/008/612/285/GCA_008612285.1_PDT000070042.2/GCA_008612285.1_PDT000070042.2_genomic.fna.gz
38 /data/ncbi/genomes/all/GCA/008/001/925/GCA_008001925.1_PDT000049072.2/GCA_008001925.1_PDT000049072.2_genomic.fna.gz
38 /data/ncbi/genomes/all/GCA/006/051/815/GCA_006051815.1_PDT000207254.2/GCA_006051815.1_PDT000207254.2_genomic.fna.gz
38 /data/ncbi/genomes/all/GCA/001/117/225/GCA_001117225.1_8447_8_32/GCA_001117225.1_8447_8_32_genomic.fna.gz
37 /data/ncbi/genomes/all/GCA/004/647/945/GCA_004647945.1_PDT000102292.2/GCA_004647945.1_PDT000102292.2_genomic.fna.gz
37 /data/ncbi/genomes/all/GCA/002/653/475/GCA_002653475.1_ASM265347v1/GCA_002653475.1_ASM265347v1_genomic.fna.gz
36 /data/ncbi/genomes/all/GCA/900/052/335/GCA_900052335.1_10625_3_19/GCA_900052335.1_10625_3_19_genomic.fna.gz
36 /data/ncbi/genomes/all/GCA/006/309/005/GCA_006309005.1_PDT000274133.2/GCA_006309005.1_PDT000274133.2_genomic.fna.gz
35 /data/ncbi/genomes/all/GCA/006/441/215/GCA_006441215.1_ASM644121v1/GCA_006441215.1_ASM644121v1_genomic.fna.gz
34 /data/ncbi/genomes/all/GCA/005/696/255/GCA_005696255.1_PDT000392782.1/GCA_005696255.1_PDT000392782.1_genomic.fna.gz
34 /data/ncbi/genomes/all/GCA/002/596/765/GCA_002596765.1_ASM259676v1/GCA_002596765.1_ASM259676v1_genomic.fna.gz
32 /data/ncbi/genomes/all/GCA/001/364/375/GCA_001364375.1_10608_2_26/GCA_001364375.1_10608_2_26_genomic.fna.gz
31 /data/ncbi/genomes/all/GCA/010/236/905/GCA_010236905.1_PDT000445952.1/GCA_010236905.1_PDT000445952.1_genomic.fna.gz
31 /data/ncbi/genomes/all/GCA/009/628/015/GCA_009628015.1_PDT000030926.2/GCA_009628015.1_PDT000030926.2_genomic.fna.gz
31 /data/ncbi/genomes/all/GCA/008/845/845/GCA_008845845.1_PDT000114462.2/GCA_008845845.1_PDT000114462.2_genomic.fna.gz
31 /data/ncbi/genomes/all/GCA/008/685/245/GCA_008685245.1_ASM868524v1/GCA_008685245.1_ASM868524v1_genomic.fna.gz
31 /data/ncbi/genomes/all/GCA/008/258/605/GCA_008258605.1_PDT000576072.1/GCA_008258605.1_PDT000576072.1_genomic.fna.gz
30 /data/ncbi/genomes/all/GCA/010/400/065/GCA_010400065.1_PDT000256434.2/GCA_010400065.1_PDT000256434.2_genomic.fna.gz
30 /data/ncbi/genomes/all/GCA/008/942/165/GCA_008942165.1_PDT000144708.2/GCA_008942165.1_PDT000144708.2_genomic.fna.gz
30 /data/ncbi/genomes/all/GCA/008/605/865/GCA_008605865.1_ASM860586v1/GCA_008605865.1_ASM860586v1_genomic.fna.gz
30 /data/ncbi/genomes/all/GCA/005/567/835/GCA_005567835.1_ASM556783v1/GCA_005567835.1_ASM556783v1_genomic.fna.gz
30 /data/ncbi/genomes/all/GCA/004/391/645/GCA_004391645.1_PDT000004314.4/GCA_004391645.1_PDT000004314.4_genomic.fna.gz
29 /data/ncbi/genomes/all/GCA/902/719/935/GCA_902719935.1_12083_7_86/GCA_902719935.1_12083_7_86_genomic.fna.gz
29 /data/ncbi/genomes/all/GCA/900/052/495/GCA_900052495.1_10678_6_12/GCA_900052495.1_10678_6_12_genomic.fna.gz
29 /data/ncbi/genomes/all/GCA/008/513/255/GCA_008513255.1_PDT000586050.1/GCA_008513255.1_PDT000586050.1_genomic.fna.gz
29 /data/ncbi/genomes/all/GCA/005/408/385/GCA_005408385.1_ASM540838v1/GCA_005408385.1_ASM540838v1_genomic.fna.gz
29 /data/ncbi/genomes/all/GCA/003/109/665/GCA_003109665.1_ASM310966v1/GCA_003109665.1_ASM310966v1_genomic.fna.gz
28 /data/ncbi/genomes/all/GCA/006/498/755/GCA_006498755.1_PDT000161244.2/GCA_006498755.1_PDT000161244.2_genomic.fna.gz
27 /data/ncbi/genomes/all/GCA/010/105/735/GCA_010105735.1_PDT000153679.2/GCA_010105735.1_PDT000153679.2_genomic.fna.gz
27 /data/ncbi/genomes/all/GCA/008/032/415/GCA_008032415.1_PDT000041808.3/GCA_008032415.1_PDT000041808.3_genomic.fna.gz
27 /data/ncbi/genomes/all/GCA/006/874/325/GCA_006874325.1_PDT000065880.2/GCA_006874325.1_PDT000065880.2_genomic.fna.gz
27 /data/ncbi/genomes/all/GCA/006/179/705/GCA_006179705.1_PDT000267758.2/GCA_006179705.1_PDT000267758.2_genomic.fna.gz
27 /data/ncbi/genomes/all/GCA/004/647/905/GCA_004647905.1_PDT000102281.2/GCA_004647905.1_PDT000102281.2_genomic.fna.gz
26 /data/ncbi/genomes/all/GCA/010/055/505/GCA_010055505.1_PDT000658019.1/GCA_010055505.1_PDT000658019.1_genomic.fna.gz
26 /data/ncbi/genomes/all/GCA/008/832/605/GCA_008832605.1_PDT000119810.2/GCA_008832605.1_PDT000119810.2_genomic.fna.gz
26 /data/ncbi/genomes/all/GCA/005/163/945/GCA_005163945.1_PDT000181854.2/GCA_005163945.1_PDT000181854.2_genomic.fna.gz
26 /data/ncbi/genomes/all/GCA/003/109/885/GCA_003109885.1_ASM310988v1/GCA_003109885.1_ASM310988v1_genomic.fna.gz
26 /data/ncbi/genomes/all/GCA/002/650/675/GCA_002650675.1_ASM265067v1/GCA_002650675.1_ASM265067v1_genomic.fna.gz
25 /data/ncbi/genomes/all/GCA/900/052/795/GCA_900052795.1_10678_6_38/GCA_900052795.1_10678_6_38_genomic.fna.gz
25 /data/ncbi/genomes/all/GCA/009/937/945/GCA_009937945.1_ASM993794v1/GCA_009937945.1_ASM993794v1_genomic.fna.gz
25 /data/ncbi/genomes/all/GCA/008/746/075/GCA_008746075.1_PDT000593690.1/GCA_008746075.1_PDT000593690.1_genomic.fna.gz
25 /data/ncbi/genomes/all/GCA/006/843/125/GCA_006843125.1_PDT000324448.1/GCA_006843125.1_PDT000324448.1_genomic.fna.gz
25 /data/ncbi/genomes/all/GCA/006/498/895/GCA_006498895.1_PDT000161392.2/GCA_006498895.1_PDT000161392.2_genomic.fna.gz
25 /data/ncbi/genomes/all/GCA/002/648/635/GCA_002648635.1_ASM264863v1/GCA_002648635.1_ASM264863v1_genomic.fna.gz
24 /data/ncbi/genomes/all/GCA/008/911/545/GCA_008911545.1_PDT000134773.2/GCA_008911545.1_PDT000134773.2_genomic.fna.gz
24 /data/ncbi/genomes/all/GCA/008/821/265/GCA_008821265.1_ASM882126v1/GCA_008821265.1_ASM882126v1_genomic.fna.gz
24 /data/ncbi/genomes/all/GCA/008/456/745/GCA_008456745.1_PDT000359439.1/GCA_008456745.1_PDT000359439.1_genomic.fna.gz
24 /data/ncbi/genomes/all/GCA/006/282/275/GCA_006282275.1_PDT000309175.2/GCA_006282275.1_PDT000309175.2_genomic.fna.gz
24 /data/ncbi/genomes/all/GCA/006/203/745/GCA_006203745.1_PDT000297560.2/GCA_006203745.1_PDT000297560.2_genomic.fna.gz
23 /data/ncbi/genomes/all/GCA/902/724/675/GCA_902724675.1_TRF.01.04_HeklaHavn_GL_AD1895/GCA_902724675.1_TRF.01.04_HeklaHavn_GL_AD1895_genomic.fna.gz
23 /data/ncbi/genomes/all/GCA/900/005/085/GCA_900005085.1_A_England_690_2010/GCA_900005085.1_A_England_690_2010_genomic.fna.gz
23 /data/ncbi/genomes/all/GCA/010/036/025/GCA_010036025.1_PDT000657209.1/GCA_010036025.1_PDT000657209.1_genomic.fna.gz
23 /data/ncbi/genomes/all/GCA/008/822/925/GCA_008822925.1_ASM882292v1/GCA_008822925.1_ASM882292v1_genomic.fna.gz
23 /data/ncbi/genomes/all/GCA/008/721/475/GCA_008721475.1_PDT000592656.1/GCA_008721475.1_PDT000592656.1_genomic.fna.gz
23 /data/ncbi/genomes/all/GCA/008/441/765/GCA_008441765.1_PDT000313113.2/GCA_008441765.1_PDT000313113.2_genomic.fna.gz
23 /data/ncbi/genomes/all/GCA/006/766/685/GCA_006766685.1_PDT000332507.1/GCA_006766685.1_PDT000332507.1_genomic.fna.gz
22 /data/ncbi/genomes/all/GCA/902/719/535/GCA_902719535.1_12083_7_43/GCA_902719535.1_12083_7_43_genomic.fna.gz
22 /data/ncbi/genomes/all/GCA/902/719/445/GCA_902719445.1_12083_7_27/GCA_902719445.1_12083_7_27_genomic.fna.gz
22 /data/ncbi/genomes/all/GCA/010/476/085/GCA_010476085.1_PDT000646810.1/GCA_010476085.1_PDT000646810.1_genomic.fna.gz
22 /data/ncbi/genomes/all/GCA/008/685/755/GCA_008685755.1_ASM868575v1/GCA_008685755.1_ASM868575v1_genomic.fna.gz
22 /data/ncbi/genomes/all/GCA/006/864/315/GCA_006864315.1_ASM686431v1/GCA_006864315.1_ASM686431v1_genomic.fna.gz
22 /data/ncbi/genomes/all/GCA/006/426/015/GCA_006426015.1_ASM642601v1/GCA_006426015.1_ASM642601v1_genomic.fna.gz
22 /data/ncbi/genomes/all/GCA/006/051/595/GCA_006051595.1_PDT000211028.2/GCA_006051595.1_PDT000211028.2_genomic.fna.gz
22 /data/ncbi/genomes/all/GCA/004/089/355/GCA_004089355.1_ASM408935v1/GCA_004089355.1_ASM408935v1_genomic.fna.gz
21 /data/ncbi/genomes/all/GCA/008/855/825/GCA_008855825.1_PDT000126325.2/GCA_008855825.1_PDT000126325.2_genomic.fna.gz
21 /data/ncbi/genomes/all/GCA/002/633/125/GCA_002633125.1_ASM263312v1/GCA_002633125.1_ASM263312v1_genomic.fna.gz
21 /data/ncbi/genomes/all/GCA/002/536/055/GCA_002536055.1_ASM253605v1/GCA_002536055.1_ASM253605v1_genomic.fna.gz
21 /data/ncbi/genomes/all/GCA/001/177/465/GCA_001177465.1_6903_5_90/GCA_001177465.1_6903_5_90_genomic.fna.gz
20 /data/ncbi/genomes/all/GCA/900/197/415/GCA_900197415.1_PEDV_GER_L01059-K07_15-01_2015/GCA_900197415.1_PEDV_GER_L01059-K07_15-01_2015_genomic.fna.gz
20 /data/ncbi/genomes/all/GCA/900/006/035/GCA_900006035.1_A_England_283_2010/GCA_900006035.1_A_England_283_2010_genomic.fna.gz
20 /data/ncbi/genomes/all/GCA/008/596/585/GCA_008596585.1_PDT000587798.1/GCA_008596585.1_PDT000587798.1_genomic.fna.gz
To speed up my troubleshooting, I subset the 14,000+ individual signature files that were giving replicated signatures in the sbt db and then ran sourmash index
and sourmash sig describe
and counted the signatures as above, to see if I could replicate the behavior with a more manageable dataset, but I could not. The 14,000+ signatures that were replicated when I ran sourmash index
on the 750,000+ signatures from the genbank assembly database were not replicated when I ran sourmash index
on just that subset. So, something about running sourmash index
on the full 750,000+ signatures is causing this replication.
Any recommendations? Should I properly script up a test to subset the signatures @ 375,000 signatures etc, to see if there is a threshold number of signatures that causes this behavior?
Yup, seems like we have a huge bug =(
Any recommendations?
Thanks for posting the commands, I'm also counting the number of signatures in the largest SBT I have at the moment. While that's running, some potential debugging ideas...
creating some very small signatures and internal nodes (like, one hash per sig, -x 1
for internal nodes). The insertion process doesn't really care about the content of the signatures, or the size of the internal nodes. (of course, we can't debug search
or gather
, but since we are only checking the SBT structure...)
Write a Python script that creates an SBT and insert the signatures, avoiding the CLI/all the other processing. (this one is for figuring out if the problem is in the SBT code, or the 'supporting' code around it)
Should I properly script up a test to subset the signatures @ 375,000 signatures etc, to see if there is a threshold number of signatures that causes this behavior?
That might be the case, but I don't remember any conditions that would trigger after a specific number of sigs is inserted... But as we saw in this issue, my gut feeling is probably wrong =P
UPDATE: yup, also seeing duplicated sigs in the DB I built. sigh.
I started https://github.com/luizirber/2020-08-14-debug-sbt for tracking this. Now building a mock SBT with sig names from genbank-bacteria, with 652k genomes.
Re:
UPDATE: yup, also seeing duplicated sigs in the DB I built. sigh.
and
I started https://github.com/luizirber/2020-08-14-debug-sbt for tracking this. Now building a mock SBT with sig names from genbank-bacteria, with 652k genomes.
Thanks for tackling this. It's probably not your favorite task :/
Re:
creating some very small signatures and internal nodes (like, one hash per sig, -x 1 for internal nodes). The insertion process doesn't really care about the content of the signatures, or the size of the internal nodes. (of course, we can't debug search or gather, but since we are only checking the SBT structure...)
I am generating single-hash signatures for my whole dataset now. Will try running sourmash index
probably by this evening and get results tomorrow morning.
Will move future comments to https://github.com/luizirber/2020-08-14-debug-sbt
Thanks for tackling this. It's probably not your favorite task :/
More like "not the task I should be doing" :grimacing: (it is actually quite fun, :detective: work!)
I am generating single-hash signatures for my whole dataset now. Will try running
sourmash index
probably by this evening and get results tomorrow morning.
It is probably easier to generate with the API (like I did in https://github.com/luizirber/2020-08-14-debug-sbt), but the result of that is... no duplicates :eyes:
So! There is something weird going on with the --from-file
parsing, what is being fed to the SBT construction...
More like "not the task I should be doing" :grimacing:
It is probably easier to generate with the API (like I did in https://github.com/luizirber/2020-08-14-debug-sbt), but the result of that is... no duplicates :eyes:
Okay, I'll use the Python API to build my SBT, following your example.
So! There is something weird going on with the --from-file parsing, what is being fed to the SBT construction...
Yes, the parsing for --from-file
is ... awkward ... besides that: #1066
hi folks, could we move this to a new issue, pls? :)
please file updates over in #1171!
Righto :blush:
On Fri, Aug 14, 2020 at 02:38:29PM -0700, Nathan Brown wrote:
Righto :blush:
;) no worries, but it's always nice to have an issue to close with a PR!
https://github.com/dib-lab/sourmash/pull/1477 could add support for "masking" arbitrary signatures from search and gather.
masking of signatures was added at CLI in https://github.com/sourmash-bio/sourmash/pull/1871.
Most of the other things we discuss in here have also been resolved elsewhere 🎉
for now, you have to rebuild sourmash lca databases from scratch if you want to change their content, but it should be straightforward to implement both an add/append and a remove.
then, separately, we would have to provide command line options for doing so.
ref https://github.com/dib-lab/sourmash/issues/555#issuecomment-429974408.