Closed sbresnahan closed 1 year ago
Yeah I can replicate that. Looks like the nucl_wgs.accession2taxid.gz
currently provided by NCBI (uploaded 2022-11-28 03:14) does not match the md5 checksum that they provide (also uploaded 2022-11-28 03:14). The current nucl_wgs.accession2taxid.gz
is apparently corrupt e.g. zcat nucl_wgs.accession2taxid.gz|tail
generates a bunch of numeric gibberish instead of nicely formatted rows as expected. So I guess the function is catching a problem as intended and the problem is upstream of taxonomizr
.
I don't really have any contacts in NCBI (I will email now to see if there's anyone I should ping in these cases) so I guess my advice would be to wait a day or two (watching https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/ for Last modified
to change if motivated) and then try to redownload.
Let me know if it's an emergency and I could try and send you an archived older version of the database.
Hmm wget https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
does return the correct file though so I guess it is something internal to taxonomizr
/R
. Let me check on that.
And I just noticed that my failure was on the first ~2Gb download (nucl_gb.accession2taxid.gz):
taxonomizr::prepareDatabase("accessionTaxa.sql")
Downloading names and nodes with getNamesAndNodes()
[100%] Downloaded 59698478 bytes...
[100%] Downloaded 49 bytes...
Preprocessing names with read.names.sql()
Preprocessing nodes with read.nodes.sql()
Downloading accession2taxid with getAccession2taxid()
This can be a big (several gigabytes) download. Please be patient and use a fast connection.
[100%] Downloaded 2237753217 bytes...
[100%] Downloaded 61 bytes...
Error in (function (xx, yy) :
Downloaded file does not match ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz File corrupted or download ended early?
while yours was on the second ~4Gb download (nucl_wgs.accession2taxid.gz).
And I repeated in R again and got a clean:
taxonomizr::prepareDatabase("accessionTaxa.sql")
Downloading names and nodes with getNamesAndNodes()
./names.dmp, ./nodes.dmp already exist. Delete to redownload
Preprocessing names with read.names.sql()
Preprocessing nodes with read.nodes.sql()
Downloading accession2taxid with getAccession2taxid()
This can be a big (several gigabytes) download. Please be patient and use a fast connection.
[100%] Downloaded 2237753217 bytes...
[100%] Downloaded 61 bytes...
[100%] Downloaded 4152831461 bytes...
[100%] Downloaded 62 bytes...
Preprocessing accession2taxid with read.accession2taxid()
Reading ./nucl_gb.accession2taxid.gz.
So I guess I'm going to suggest there's something a bit funny going on with NCBI's servers and downloads are occasionally silently failing (or at least I hope that's it since debugging stochastic download errors doesn't sound particularly fun). And I guess my advice would be to repeat a few times deleting the file if corrupted (if the first file cleanly downloads then could leave it in the working folder so taxonomizr doesn't have to redownload that one).
Let me know if that fixes things if you get the chance.
@sherrillmix - The file is now downloading correctly. However (maybe this is a distinct issue), I'm seeing now that even if I set tmpDir
to an external drive with 2Tb free disk space in the call to prepareDatabase()
, the internal disk space on my device drops steadily until I begin to get "low disk space" notifications and the call crashes at "Reading ./nucl_gb.accession2taxid.gz". I do not know in which directory the underlying process is occurring, but shouldn't it be within tmpDir
?
That's a common problem with sqlite since sqlite uses it's own temp directory settings independent of R. So you have to do a bit of a workaround to get it to use the correct directory e.g. prepareDatabase(extraSqlCommand="PRAGMA temp_store_directory = '/MY/TMP/DIR'")
or other suggestions in #41.
SettingextraSqlCommand="PRAGMA temp_store_directory = '/path/to/my/external2Tb'"
does not fix this issue for me; disk space in the wrong partition is still being used up.
I assume you did but just in case you'd probably want both tmpDir="some_directory"
for R's temp files and extraSqlCommand="PRAGMA temp_store_directory = '/path/to/my/external2Tb'"
for sqlite's. Otherwise, maybe try some of the other suggestions from #41 and if none work, maybe report sessionInfo()
to start digging into things.
Perhaps this is an issue of version incompatibility - temp_store_directory
is deprecated PRAGMA (https://www.sqlite.org/pragma.html#pragma_temp_store_directory)
Setting the SQLITE_TMPDIR
environment variable to the desired directory and running prepareDatabase()
via command line R instead of RStudio resolved this issue for me.
Yeah it's annoying that there's not a consistent simple fix within R/RStudio. Setting SQLITE_TMPDIR
or TMPDIR
ahead of time or starting R with the variable definition prepended SQLITE_TMPDIR=/location/of/drive R
seem like good options. Thanks for the bug report.
Running
prepareDatabase("accessionTaxa.sql",tmpDir="some_directory")
yields:Downloading names and nodes with getNamesAndNodes() [100%] Downloaded 59698472 bytes... [100%] Downloaded 49 bytes... Preprocessing names with read.names.sql() Preprocessing nodes with read.nodes.sql() Downloading accession2taxid with getAccession2taxid() This can be a big (several gigabytes) download. Please be patient and use a fast connection. [100%] Downloaded 2237753217 bytes... [100%] Downloaded 61 bytes... [100%] Downloaded 4152831461 bytes... [100%] Downloaded 62 bytes... Error in (function (xx, yy) : Downloaded file does not match ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_wgs.accession2taxid.gz File corrupted or download ended early? Calls: prepareDatabase -> do.call -> -> mapply ->
Execution halted