change db and cache path

bergalu commented 10 months ago

Good afternoon to everybody,

I would need to change the location of the cache folder and I am wondering if there is a way to do that.

Moreover I have to place the different databases in folders outside the cache one, how can I realize it? I have downloaded the database with another computer, put it in the workstation to be used, but then I am at a loss to tell taxizedb where to pick up the database when needed.

I'm newbie to R and taxizedb, so I apologise if my questions sound trivial.

Many thanks in advance, Luca

stitam commented 10 months ago

Hi @bergalu, thanks for raising this issue, this is not trivial.

In taxizedb caching is managed through the hoardr package (https://github.com/ropensci/hoardr). In short, you can get the current cache path using tdb_cache$cache_path_get() and set it using tdb_cache$cache_path_set(). You can access the help page with ?tdb_cache or visit the github page for hoardr for more information. Does this help?

bergalu commented 10 months ago

Hi @stitam , many thanks for your prompt response.

By default, the cache path in the workstation where I need to run taxizedb is:

tdb_cache$cache_path_get() [1] "~/.cache/R/taxizedb"

I want the chace folder to be: /gscratch/databases/

I had a look at the hoardr documentation and I succeeded in changing the absolute path by executing:

tdb_cache$cache_path_set(full_path = '/gscratch/databases') [1] "/gscratch/databases"

Even so, when I exit R, I enter it later on and I load the taxizedb package again, the chace path is reset to the default one. 1) Is there a way to make R retain the wanted cache path (/gscratch/databases)?

Moreover, I tried to download the ncbi database with the db_download_ncbi command, but it fails:

db_download_ncbi(verbose = TRUE, overwrite = FALSE) downloading... Error in curl::curl_download(db_url, db_path_file, quiet = TRUE) : Timeout was reached: [] Failed to connect to ftp.ncbi.nih.gov port 21 after 7984 ms: Connection timed out

I verified that I can connect to and download from (with wget): https://ftp.ncbi.nih.gov/pub/taxonomy/ but not ftp://ftp.ncbi.nih.gov/pub/taxonomy/ maybe this is the problem in my case.

Anyway, I tried to work it around by downloading myself the file: https://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip

but then, 2) which steps should I follow in order to build the sql database and link it properly to taxizedb?

Many thanks in advance, Luca

arendsee commented 7 months ago

@bergalu @stitam I've run into the same issue with the curl command timing out. So like Luca, I downloaded the NCBI taxonomy dump myself and hit the same problem with figuring out how to make taxizedb process the zip file.

To solve the problem, I forked the repo and added a path option to each of the db_download_* functions (these include the db_download_ncbi function we are both using). With path we can specify our own input file and it will be passed into the same setup code as the file retrieved by default through curl.

So you can do the following:

taxizedb::db_download_ncbi(path="taxdmp.zip")

Where "taxdmp.zip" is your locally downloaded file. The zip file will be processed into an sqlite database and managed under hoardr.

My fork is at https://github.com/arendsee/taxizedb. If this looks good, I can make a PR.

stitam commented 7 months ago

Thanks @arendsee for working on this, I looked at your commit and it looks good. I was wondering if it is good practice to (optionally) eliminate the "download" part from functions what have "download" in their names, but it's probably fine. This is also good for reproducibility, if someone wants to store the downloaded raw files as well, they can.

Can you please open the PR?

ropensci / taxizedb

change db and cache path #71