sherrillmix / taxonomizr

Parse NCBI taxonomy and accessions to find taxonomic assignments
GNU General Public License v2.0
70 stars 11 forks source link

Error in preparing accessionTaxa.sql with prepareDatabase() #52

Closed sbresnahan closed 1 year ago

sbresnahan commented 1 year ago

Running prepareDatabase("accessionTaxa.sql",tmpDir="some_directory") yields:

Downloading names and nodes with getNamesAndNodes() [100%] Downloaded 59698472 bytes... [100%] Downloaded 49 bytes... Preprocessing names with read.names.sql() Preprocessing nodes with read.nodes.sql() Downloading accession2taxid with getAccession2taxid() This can be a big (several gigabytes) download. Please be patient and use a fast connection. [100%] Downloaded 2237753217 bytes... [100%] Downloaded 61 bytes... [100%] Downloaded 4152831461 bytes... [100%] Downloaded 62 bytes... Error in (function (xx, yy) : Downloaded file does not match ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_wgs.accession2taxid.gz File corrupted or download ended early? Calls: prepareDatabase -> do.call -> -> mapply -> Execution halted

sherrillmix commented 1 year ago

Yeah I can replicate that. Looks like the nucl_wgs.accession2taxid.gz currently provided by NCBI (uploaded 2022-11-28 03:14) does not match the md5 checksum that they provide (also uploaded 2022-11-28 03:14). The current nucl_wgs.accession2taxid.gz is apparently corrupt e.g. zcat nucl_wgs.accession2taxid.gz|tail generates a bunch of numeric gibberish instead of nicely formatted rows as expected. So I guess the function is catching a problem as intended and the problem is upstream of taxonomizr.

I don't really have any contacts in NCBI (I will email now to see if there's anyone I should ping in these cases) so I guess my advice would be to wait a day or two (watching https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/ for Last modified to change if motivated) and then try to redownload.

Let me know if it's an emergency and I could try and send you an archived older version of the database.

sherrillmix commented 1 year ago

Hmm wget https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz does return the correct file though so I guess it is something internal to taxonomizr/R. Let me check on that.

sherrillmix commented 1 year ago

And I just noticed that my failure was on the first ~2Gb download (nucl_gb.accession2taxid.gz):

taxonomizr::prepareDatabase("accessionTaxa.sql")
Downloading names and nodes with getNamesAndNodes()
 [100%] Downloaded 59698478 bytes...
 [100%] Downloaded 49 bytes...
Preprocessing names with read.names.sql()
Preprocessing nodes with read.nodes.sql()
Downloading accession2taxid with getAccession2taxid()
This can be a big (several gigabytes) download. Please be patient and use a fast connection.
 [100%] Downloaded 2237753217 bytes...
 [100%] Downloaded 61 bytes...
Error in (function (xx, yy)  : 
  Downloaded file does not match ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz File corrupted or download ended early?

while yours was on the second ~4Gb download (nucl_wgs.accession2taxid.gz).

And I repeated in R again and got a clean:

taxonomizr::prepareDatabase("accessionTaxa.sql")
Downloading names and nodes with getNamesAndNodes()
./names.dmp, ./nodes.dmp already exist. Delete to redownload
Preprocessing names with read.names.sql()
Preprocessing nodes with read.nodes.sql()
Downloading accession2taxid with getAccession2taxid()
This can be a big (several gigabytes) download. Please be patient and use a fast connection.
 [100%] Downloaded 2237753217 bytes...
 [100%] Downloaded 61 bytes...
 [100%] Downloaded 4152831461 bytes...
 [100%] Downloaded 62 bytes...
Preprocessing accession2taxid with read.accession2taxid()
Reading ./nucl_gb.accession2taxid.gz.

So I guess I'm going to suggest there's something a bit funny going on with NCBI's servers and downloads are occasionally silently failing (or at least I hope that's it since debugging stochastic download errors doesn't sound particularly fun). And I guess my advice would be to repeat a few times deleting the file if corrupted (if the first file cleanly downloads then could leave it in the working folder so taxonomizr doesn't have to redownload that one).

Let me know if that fixes things if you get the chance.

sbresnahan commented 1 year ago

@sherrillmix - The file is now downloading correctly. However (maybe this is a distinct issue), I'm seeing now that even if I set tmpDir to an external drive with 2Tb free disk space in the call to prepareDatabase(), the internal disk space on my device drops steadily until I begin to get "low disk space" notifications and the call crashes at "Reading ./nucl_gb.accession2taxid.gz". I do not know in which directory the underlying process is occurring, but shouldn't it be within tmpDir?

sherrillmix commented 1 year ago

That's a common problem with sqlite since sqlite uses it's own temp directory settings independent of R. So you have to do a bit of a workaround to get it to use the correct directory e.g. prepareDatabase(extraSqlCommand="PRAGMA temp_store_directory = '/MY/TMP/DIR'") or other suggestions in #41.

sbresnahan commented 1 year ago

SettingextraSqlCommand="PRAGMA temp_store_directory = '/path/to/my/external2Tb'" does not fix this issue for me; disk space in the wrong partition is still being used up.

sherrillmix commented 1 year ago

I assume you did but just in case you'd probably want both tmpDir="some_directory" for R's temp files and extraSqlCommand="PRAGMA temp_store_directory = '/path/to/my/external2Tb'" for sqlite's. Otherwise, maybe try some of the other suggestions from #41 and if none work, maybe report sessionInfo() to start digging into things.

sbresnahan commented 1 year ago

Perhaps this is an issue of version incompatibility - temp_store_directory is deprecated PRAGMA (https://www.sqlite.org/pragma.html#pragma_temp_store_directory)

sbresnahan commented 1 year ago

Setting the SQLITE_TMPDIR environment variable to the desired directory and running prepareDatabase() via command line R instead of RStudio resolved this issue for me.

sherrillmix commented 1 year ago

Yeah it's annoying that there's not a consistent simple fix within R/RStudio. Setting SQLITE_TMPDIR or TMPDIR ahead of time or starting R with the variable definition prepended SQLITE_TMPDIR=/location/of/drive R seem like good options. Thanks for the bug report.