sourmash-bio / sourmash_plugin_directsketch

BSD 3-Clause "New" or "Revised" License
1 stars 1 forks source link

Discontinuity between `--failed` reports and final database #25

Open ccbaumler opened 1 month ago

ccbaumler commented 1 month ago

I noticed some occasional messages while running version 0.2.2 that loads inbetween Starting accession #/# (%), such as:

  1. Error: Failed to send request
  2. Error: Invalid checksum line format in URL https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/593/135/GCA_001593135.1_ASM159313v1/md5checksums.txt: <head>
  3. Error: Error processing signature

comparing the report files

While these messages were infrequent, there seem to be a lot in the failures category:

The report for the genomes requiring an updated version from the old fungi genbank to the current fungi genbank:

$ wc -l data/update.20240509-fungi.csv
244 data/update.20240509-fungi.csv

I expect to see ~240 genomes removed and added back into the fungi DBs

$ wc -l data/update.20240509-fungi.failures.csv
189 data/update.20240509-fungi.failures.csv

Failures are reported to be 189 out of the 244! I now expect to only see 55 genomes sketches in the DBs.

comparing the dbs

The OG db that I started with:

$ sourmash sig summarize /group/ctbrowngrp/sourmash-db/genbank-2022.03/genbank-2022.03-fungi-k21.zip

== This is sourmash version 4.8.5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from '/group/ctbrowngrp/sourmash-db/genbank-2022.03/genbank-2022.03-fungi-k21.zip'
path filetype: ZipFileLinearIndex
location: /group/ctbrowngrp/sourmash-db/genbank-2022.03/genbank-2022.03-fungi-k21.zip
is database? yes
has manifest? yes
num signatures: 10285
** examining manifest...
total hashes: 327547787
summary of sketches:
   10285 sketches with DNA, k=21, scaled=1000, abund  327547787 total hashes

I see 10285 sketches are present in the original DB

This is the database that has been cleansed of the genome sketches requiring new versions (and any suspended/removed genomes):

sourmash sig summarize ../dbs/genbank-20240509-fungi-k21.clean.zip

== This is sourmash version 4.8.5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from '../dbs/genbank-20240509-fungi-k21.clean.zip'
path filetype: ZipFileLinearIndex
location: /home/baumlerc/2024-database-creation/dbs/genbank-20240509-fungi-k21.clean.zip
is database? yes
has manifest? yes
num signatures: 10036
** examining manifest...
total hashes: 321575010
summary of sketches:
   10036 sketches with DNA, k=21, scaled=1000, abund  321575010 total hashes

This is the "clean" DB I am adding new sketches to with directsketch and sig cat.

This is the database containing only the genome sketches requiring updated versions:

$ sourmash sig summarize ../dbs/genbank-20240509-fungi.rever.zip

== This is sourmash version 4.8.5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from '../dbs/genbank-20240509-fungi.rever.zip'
path filetype: ZipFileLinearIndex
location: /home/baumlerc/2024-database-creation/dbs/genbank-20240509-fungi.rever.zip
is database? yes
has manifest? yes
num signatures: 714
** examining manifest...
total hashes: 17989561
summary of sketches:
   238 sketches with dna, k=21, scaled=1000, abund    5672874 total hashes
   238 sketches with dna, k=51, scaled=1000, abund    6340464 total hashes
   238 sketches with dna, k=31, scaled=1000, abund    5976223 total hashes

Looks like 238 genomes were sketched and that does not jive with the failure report.

Here is the final "updated" fungi database that has all "bad genomes" removed and has only the reversioned genome sketches added:

$ sourmash sig summarize /group/ctbrowngrp/sourmash-db/genbank-20240509/genbank-20240509-fungi-k21.update.zip

== This is sourmash version 4.8.5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from '/group/ctbrowngrp/sourmash-db/genbank-20240509/genbank-20240509-fungi-k21.update.zip'
path filetype: ZipFileLinearIndex
location: /group/ctbrowngrp/sourmash-db/genbank-20240509/genbank-20240509-fungi-k21.update.zip
is database? yes
has manifest? yes
num signatures: 10274
** examining manifest...
total hashes: 327247884
summary of sketches:
   10274 sketches with DNA, k=21, scaled=1000, abund  327247884 total hashes

After removing all versioned genomes and removed/suspended genomes, I am starting with 10038 sketches and adding back 238 for a grand total of 10274 sketches. The databases add up and make sense while doing so. yay! With this, I would have expected only 11 lines in the failure.

When looking closer at the failure CSV, there are many protein files included:

$  awk -F',' '{print $3}' data/update.20240509-fungi.failures.csv | grep DNA | wc -l
4

awk -F',' '{print $3}' data/update.20240509-fungi.failures.csv | grep protein | wc -l
184
/home/baumlerc/2024-database-creation/workflow/data/update.20240509-fungi.failures.csv

The report file from update-sourmash-dbs.py states:

From 10285 in 'data/mf.20240509-fungi-k21.csv':
Kept 10036 in 'data/mf-clean.20240509-fungi-k21.csv.
Removed 249 total.
Removed 6 identifiers because of suspected suspension of the genome.
Removed 243 because of changed version.

That changed version state is almost correct (could be an indexing error) but the 4 failed dna in failures and the 238 sketched genomes make 242.

/home/baumlerc/2024-database-creation/workflow/data/report.20240509-fungi-k21.txt

The full updated DB is still running will update when finished...

The report for the missing genomes to update the old fungi genbank to the current fungi genbank:

wc -l data/missing.20240509-fungi.csv  
7856 data/missing.20240509-fungi.csv

I expect to update the database with

wc -l data/missing.20240509-fungi.failures.csv
5890 data/missing.20240509-fungi.failures.csv

$ sourmash sig summarize ../dbs/genbank-20240509-fungi.miss.zip

== This is sourmash version 4.8.5. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading from '../dbs/genbank-20240509-fungi.miss.zip' path filetype: ZipFileLinearIndex location: /home/baumlerc/2024-database-creation/dbs/genbank-20240509-fungi.miss.zip is database? yes has manifest? yes num signatures: 23514 examining manifest... total hashes: 668404237 summary of sketches: 7838 sketches with dna, k=51, scaled=1000, abund 228448257 total hashes 7838 sketches with dna, k=31, scaled=1000, abund 222722670 total hashes 7838 sketches with dna, k=21, scaled=1000, abund 217233310 total hashes

bluegenes commented 1 month ago

thanks @ccbaumler! Can you add an example of the command you were running with?

I'm guessing you expected only genome downloads b/c that's what you specified for signatures, but didn't use --genomes-only? I'll make an issue to avoid trying to download protein if we're not building protein sketches.

more later!

bluegenes commented 1 month ago
  • Error: Failed to send request
  • Error: Error processing signature

I haven't gotten these yet in my testing :)!

They are, specifically:

ccbaumler commented 1 month ago

Can you add an example of the command you were running with?

Sure, here is an example of the command in my workflow:

sourmash scripts gbsketch data/update.20240509-fungi.csv -o ../dbs/genbank-20240509-fungi.rever.zip --failed data/update.20240509-fungi.failures.csv --param-str "dna,k=21,k=31,k=51,scaled=1000,abund" -r 1

I'm guessing you expected only genome downloads b/c that's what you specified for signatures, but didn't use --genomes-only? >I'll make an issue to avoid trying to download protein if we're not building protein sketches.

Yup, completely missed that command! Thought that the --params-str was the filter for filetype. (i.e. if moltype == DNA; get fna)

I am sketching the dna DBs first to get a feel for the workflow. I will then incorporate a protein rule set. With proper k and scaled values.

I think I'll update the command to include -g and increase the -r to 5.