Open ccbaumler opened 1 month ago
thanks @ccbaumler! Can you add an example of the command you were running with?
I'm guessing you expected only genome downloads b/c that's what you specified for signatures, but didn't use --genomes-only
? I'll make an issue to avoid trying to download protein if we're not building protein sketches.
more later!
Error: Failed to send request
Error: Error processing signature
I haven't gotten these yet in my testing :)!
They are, specifically:
Error: Failed to send request
- error sending the md5sum download request to the server (= error downloading md5sum file)Error: Error processing signature
- error writing sigs
Can you add an example of the command you were running with?
Sure, here is an example of the command in my workflow:
sourmash scripts gbsketch data/update.20240509-fungi.csv -o ../dbs/genbank-20240509-fungi.rever.zip --failed data/update.20240509-fungi.failures.csv --param-str "dna,k=21,k=31,k=51,scaled=1000,abund" -r 1
I'm guessing you expected only genome downloads b/c that's what you specified for signatures, but didn't use --genomes-only? >I'll make an issue to avoid trying to download protein if we're not building protein sketches.
Yup, completely missed that command! Thought that the --params-str
was the filter for filetype. (i.e. if moltype == DNA; get fna)
I am sketching the dna DBs first to get a feel for the workflow. I will then incorporate a protein rule set. With proper k and scaled values.
I think I'll update the command to include -g
and increase the -r
to 5.
I noticed some occasional messages while running version 0.2.2 that loads inbetween
Starting accession #/# (%)
, such as:Error: Failed to send request
Error: Invalid checksum line format in URL https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/593/135/GCA_001593135.1_ASM159313v1/md5checksums.txt: <head>
Error: Error processing signature
comparing the report files
While these messages were infrequent, there seem to be a lot in the failures category:
The report for the genomes requiring an updated version from the old fungi genbank to the current fungi genbank:
comparing the dbs
The OG db that I started with:
This is the database that has been cleansed of the genome sketches requiring new versions (and any suspended/removed genomes):
This is the database containing only the genome sketches requiring updated versions:
Here is the final "updated" fungi database that has all "bad genomes" removed and has only the reversioned genome sketches added:
When looking closer at the failure CSV, there are many protein files included:
The report file from
update-sourmash-dbs.py
states:The full updated DB is still running will update when finished...
The report for the missing genomes to update the old fungi genbank to the current fungi genbank:
$ sourmash sig summarize ../dbs/genbank-20240509-fungi.miss.zip
== This is sourmash version 4.8.5. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
loading from '../dbs/genbank-20240509-fungi.miss.zip' path filetype: ZipFileLinearIndex location: /home/baumlerc/2024-database-creation/dbs/genbank-20240509-fungi.miss.zip is database? yes has manifest? yes num signatures: 23514 examining manifest... total hashes: 668404237 summary of sketches: 7838 sketches with dna, k=51, scaled=1000, abund 228448257 total hashes 7838 sketches with dna, k=31, scaled=1000, abund 222722670 total hashes 7838 sketches with dna, k=21, scaled=1000, abund 217233310 total hashes