ncbi / sra-tools

SRA Tools
Other
1.1k stars 242 forks source link

Force local file only usage for dockerized fasterq-dump on cloud-delivered data #635

Closed olgabot closed 2 years ago

olgabot commented 2 years ago

Hello, Hope you are well. I am using the ncbi/sra-tools docker image (thank you for providing it!) to run fasterq-dump on cloud delivered dbGap controlled access data. Whenever I try to run fasterq-dump, it always tries to fetch the data remotely, even though the file exists locally. How can I force fasterq-dump to ONLY use the local file?

I tried using the vdb-config -s/repository/remote/disabled=true as mentioned in this issue https://github.com/ncbi/sra-tools/issues/500, but I get the complaint that this command must be run with sudo, and when I run with sudo, it doesn't work at all.

Here's the fasterq-dump version information:

(base)
 ✘ ⚙  Thu 28 Apr - 18:41  /data/fasterq-dump-test 
  docker run -it ncbi/sra-tools fasterq-dump  --version

"fasterq-dump" version 3.0.0
`docker run -it ncbi/sra-tools vdb-config` output ``` (base) ✘ ⚙  Thu 28 Apr - 18:48  /data/fasterq-dump-test   docker run -it ncbi/sra-tools vdb-config -s/repository/remote/disabled=true 2022-04-28T18:48:25 vdb-config.3.0.0 err: condition violated while updating node - Warning: normally this application should not be run as root/superuser (base) ✘ ⚙  Thu 28 Apr - 18:48  /data/fasterq-dump-test   docker run -it ncbi/sra-tools sudo vdb-config -s/repository/remote/disabled=true docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "sudo": executable file not found in $PATH: unknown. ERRO[0000] error waiting for container: context canceled ```
Folder structure of local SRR files ``` (base) ✘ ⚙  Thu 28 Apr - 18:49  /data/fasterq-dump-test   ll Permissions Size User Date Modified Name drwxrwxr-x - olgabot 28 Apr 18:49 SRR1070986 (base) ⚙  Thu 28 Apr - 18:49  /data/fasterq-dump-test   ll SRR1070986 Permissions Size User Date Modified Name .rw-rw-r-- 3.8G olgabot 20 Apr 20:19 SRR1070986 .rw-rw-r-- 3.8G olgabot 28 Apr 00:17 SRR1070986.sra .rw-rw-r-- 704M olgabot 11 Apr 17:57 SRR1070986.vdbcache ```
`docker run -it ncbi/sra-tools fasterq-dump` output ### Using `SRR1070986/` folder ``` (base) ✘ ⚙  Thu 28 Apr - 18:53  /data/fasterq-dump-test   docker run -it ncbi/sra-tools fasterq-dump --threads 2 --progress -vvv --log-level info SRR1070986 Preference setting is: Prefer SRA Normalized Format files with full base quality scores if available. 2022-04-28T18:53:44 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-28T18:53:44 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-28T18:53:44 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-28T18:53:44 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-28T18:53:44 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-28T18:53:44 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to locate.ncbi.nlm.nih.gov (130.14.29.113) 2022-04-28T18:53:44 fasterq-dump.3.0.0: Setting up SSL/TLS structure 2022-04-28T18:53:44 fasterq-dump.3.0.0: Performing SSL/TLS handshake... 2022-04-28T18:53:44 fasterq-dump.3.0.0: KClientHttpOpen - verifying CA cert 2022-04-28T18:53:44 fasterq-dump.3.0.0: Verifying peer X.509 certificate... 2022-04-28T18:53:44 fasterq-dump.3.0.0: Reading from server... 2022-04-28T18:53:44 fasterq-dump.3.0.0 err: query unauthorized while resolving query within virtual file system module - failed to resolve accession 'SRR1070986' - Access denied - please request permission to access phs000424 / GRU in dbGaP. ( 403 ) Query SRR1070986: Error 403 Access denied - please request permission to access phs000424 / GRU in dbGaP. 2022-04-28T18:53:44 fasterq-dump.3.0.0: Seeding the random number generator 2022-04-28T18:53:44 fasterq-dump.3.0.0: Loading CA root certificates 2022-04-28T18:53:44 fasterq-dump.3.0.0: Parsing text for default CA root certificates 2022-04-28T18:53:44 fasterq-dump.3.0.0: Configuring SSl defaults 2022-04-28T18:53:44 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-28T18:53:45 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to locate.ncbi.nlm.nih.gov (130.14.29.113) 2022-04-28T18:53:45 fasterq-dump.3.0.0: Setting up SSL/TLS structure 2022-04-28T18:53:45 fasterq-dump.3.0.0: Performing SSL/TLS handshake... 2022-04-28T18:53:45 fasterq-dump.3.0.0: KClientHttpOpen - verifying CA cert 2022-04-28T18:53:45 fasterq-dump.3.0.0: Verifying peer X.509 certificate... 2022-04-28T18:53:45 fasterq-dump.3.0.0: Reading from server... 2022-04-28T18:53:45 fasterq-dump.3.0.0 err: query unauthorized while resolving query within virtual file system module - failed to resolve accession 'SRR1070986' - Access denied - please request permission to access phs000424 / GRU in dbGaP. ( 403 ) fasterq-dump quit with error code 3 ``` ### Using `SRR1070986/SRR1070986` filename ``` (base) ✘ ⚙  Thu 28 Apr - 00:15  /data/fasterq-dump-test   docker run -it ncbi/sra-tools fasterq-dump --threads 2 --progress -vvv --log-level info SRR1070986/SRR1070986 Preference setting is: Prefer SRA Normalized Format files with full base quality scores if available. 2022-04-28T00:15:35 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-28T00:15:35 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-28T00:15:35 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-28T00:15:35 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-28T00:15:35 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-28T00:15:35 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to locate.ncbi.nlm.nih.gov (130.14.29.113) 2022-04-28T00:15:35 fasterq-dump.3.0.0: Setting up SSL/TLS structure 2022-04-28T00:15:35 fasterq-dump.3.0.0: Performing SSL/TLS handshake... 2022-04-28T00:15:35 fasterq-dump.3.0.0: KClientHttpOpen - verifying CA cert 2022-04-28T00:15:35 fasterq-dump.3.0.0: Verifying peer X.509 certificate... 2022-04-28T00:15:35 fasterq-dump.3.0.0: Reading from server... 2022-04-28T00:15:35 fasterq-dump.3.0.0 err: name not found while resolving query within virtual file system module - failed to resolve accession 'SRR1070986/SRR1070986' - no data ( 404 ) 2022-04-28T00:15:35 fasterq-dump.3.0.0: Seeding the random number generator 2022-04-28T00:15:35 fasterq-dump.3.0.0: Loading CA root certificates 2022-04-28T00:15:35 fasterq-dump.3.0.0: Parsing text for default CA root certificates 2022-04-28T00:15:35 fasterq-dump.3.0.0: Configuring SSl defaults fasterq-dump quit with error code 3 ``` ### Using `SRR1070986/SRR1070986.sra` filename ``` (base) ⚙  Thu 28 Apr - 00:17  /data/fasterq-dump-test   docker run -it ncbi/sra-tools fasterq-dump --threads 2 --progress -vvv --log-level info SRR1070986/SRR1070986.sra Preference setting is: Prefer SRA Normalized Format files with full base quality scores if available. 2022-04-28T18:38:55 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-28T18:38:55 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-28T18:38:55 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-28T18:38:55 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-28T18:38:55 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-28T18:38:56 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to locate.ncbi.nlm.nih.gov (130.14.29.113) 2022-04-28T18:38:56 fasterq-dump.3.0.0: Setting up SSL/TLS structure 2022-04-28T18:38:56 fasterq-dump.3.0.0: Performing SSL/TLS handshake... 2022-04-28T18:38:56 fasterq-dump.3.0.0: KClientHttpOpen - verifying CA cert 2022-04-28T18:38:56 fasterq-dump.3.0.0: Verifying peer X.509 certificate... 2022-04-28T18:38:56 fasterq-dump.3.0.0: Reading from server... 2022-04-28T18:38:57 fasterq-dump.3.0.0: Reading from server... 2022-04-28T18:38:57 fasterq-dump.3.0.0: Reading from server... 2022-04-28T18:38:57 fasterq-dump.3.0.0 err: name not found while resolving query within virtual file system module - failed to resolve accession 'SRR1070986/SRR1070986.sra' - no data ( 404 ) SRR1070986/SRR1070986.sra is an SRA Normalized Format file with full base quality scores. 2022-04-28T18:38:57 fasterq-dump.3.0.0: Seeding the random number generator 2022-04-28T18:38:57 fasterq-dump.3.0.0: Loading CA root certificates 2022-04-28T18:38:57 fasterq-dump.3.0.0: Parsing text for default CA root certificates 2022-04-28T18:38:57 fasterq-dump.3.0.0: Configuring SSl defaults fasterq-dump quit with error code 3 ```

I would greatly appreciate your help! Thank you so much. Warmest, Olga

klymenko commented 2 years ago

SRR1070986 is a protected run. You have to use --ngc option to access it.

olgabot commented 2 years ago

SRR1070986 is a protected run. You have to use --ngc option to access it.

Hi @klymenko, thank you -- I realize it is a protected run, and I already have downloaded the SRR files via the "NCBI Cloud Delivery" option into our bucket:

(base)
 ⚙  Thu 28 Apr - 21:45  /data/fasterq-dump-test 
  ll */*
Permissions Size User    Date Modified Name
.rw-rw-r--  3.8G olgabot 20 Apr 20:19  SRR1070986/SRR1070986
.rw-rw-r--  3.8G olgabot 28 Apr 00:17  SRR1070986/SRR1070986.sra
.rw-rw-r--  704M olgabot 11 Apr 17:57  SRR1070986/SRR1070986.vdbcache

Since I have already downloaded the files, I don't want fasterq-dump to even try to download them, but instead use the local file. Is this possible?

klymenko commented 2 years ago

How did you download it? Did you use prefetch? It this case you had to supply ngc file as --ngc option.

If not - send your request to sra-tools@ncbi.nlm.nih.gov.

olgabot commented 2 years ago

Hi Andrew, The SRR files were downloaded using the NCBI Cloud Delivery https://www.ncbi.nlm.nih.gov/sra/docs/data-delivery/ from the NCBI bucket directly into our AWS bucket. We were not able to download the Fastqs directly because they exceeded the 5TB limit for the runs we selected. How do you recommend converting SRR files obtained through the NCBI Cloud Delivery service into fastqs? Thank you! Warmest, Olga


Olga Botvinnik, PhD olgabotvinnik.com http://www.olgabotvinnik.com

On Thu, Apr 28, 2022 at 6:51 PM Andrew Klymenko @.***> wrote:

How did you download it? Did you use prefetch? It this case you had to supply ngc file as --ngc option.

If not - send your request to @.***

— Reply to this email directly, view it on GitHub https://github.com/ncbi/sra-tools/issues/635#issuecomment-1112808875, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGE24FRHBQOACJSLCVUUDDVHM6AXANCNFSM5UTN4MAQ . You are receiving this because you authored the thread.Message ID: @.***>

durbrow commented 2 years ago

Each invocation of docker run creates a new instance of the image that is unrelated to any previous instances of the image. Your vdb-config command and your subsequent fasterq-dump happened in different instances of the image, and thus the first command had no effect on the second one.

Here is way to do what you want: First, in an empty directory, create a Dockerfile containing:

FROM ncbi/sra-tools
RUN vdb-config <... the rest of your configuration command ...>

Then run docker build --tag my-sra-tools . to create your own customized image. When you docker run your image, your configuration will be active.

olgabot commented 2 years ago

Thank you for those suggestions! I'm still not able to get fasterq-dump to recognize the local file and not try to fetch anything from the server, even after doing vdb-config -s/repository/remote/disabled=true Here is the Dockerfile:

(base)
 Fri 29 Apr - 18:30  /data/fasterq-dump-test 
  cat Dockerfile
FROM ncbi/sra-tools
RUN vdb-config --root -s/repository/remote/disabled=true%

Here is the docker build log:

(base)
 Fri 29 Apr - 18:31  /data/fasterq-dump-test 
  docker build -t bridgebio/sra-tools .
Sending build context to Docker daemon  8.387GB
Step 1/2 : FROM ncbi/sra-tools
 ---> 4c0b31b98aec
Step 2/2 : RUN vdb-config --root -s/repository/remote/disabled=true
 ---> Using cache
 ---> dea8d96d988b
Successfully built dea8d96d988b
Successfully tagged bridgebio/sra-tools:latest

No matter whether I use the SRR../ as a folder, or the bare SRR file, fasterq-dump keeps erroring out. It seems to be recognizing the local file because I see SRR1070986/SRR1070986.sra is an SRA Normalized Format file with full base quality scores, but no fastq files are generated.

Here is the local file structure:

(base)
 Fri 29 Apr - 18:38  /data/fasterq-dump-test 
  ll
Permissions Size User    Date Modified Name
.rw-rw-r--    76 olgabot 29 Apr 18:21  Dockerfile
.rw-rw-r--    81 olgabot 29 Apr 18:20  Dockerfile~
drwxrwxr-x     - olgabot 28 Apr 18:49  SRR1070986
(base)
 Fri 29 Apr - 18:38  /data/fasterq-dump-test 
  ll SRR1070986
Permissions Size User    Date Modified Name
.rw-rw-r--  3.8G olgabot 20 Apr 20:19  SRR1070986
.rw-rw-r--  3.8G olgabot 28 Apr 00:17  SRR1070986.sra
.rw-rw-r--  704M olgabot 11 Apr 17:57  SRR1070986.vdbcache
`fasterq-dump` error logs ``` (base) ✘  Fri 29 Apr - 18:34  /data/fasterq-dump-test   docker run -it bridgebio/sra-tools fasterq-dump --threads 2 --progress -vvv SRR1070986 Preference setting is: Prefer SRA Normalized Format files with full base quality scores if available. 2022-04-29T18:35:25 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-29T18:35:25 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-29T18:35:25 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-29T18:35:25 fasterq-dump.3.0.0: Seeding the random number generator 2022-04-29T18:35:25 fasterq-dump.3.0.0: Loading CA root certificates 2022-04-29T18:35:25 fasterq-dump.3.0.0: Parsing text for default CA root certificates 2022-04-29T18:35:25 fasterq-dump.3.0.0: Configuring SSl defaults fasterq-dump quit with error code 3 (base) ✘  Fri 29 Apr - 18:35  /data/fasterq-dump-test   docker run -it bridgebio/sra-tools fasterq-dump --threads 2 --progress -vvv SRR1070986/ Preference setting is: Prefer SRA Normalized Format files with full base quality scores if available. 2022-04-29T18:35:29 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-29T18:35:29 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-29T18:35:29 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-29T18:35:29 fasterq-dump.3.0.0: Seeding the random number generator 2022-04-29T18:35:29 fasterq-dump.3.0.0: Loading CA root certificates 2022-04-29T18:35:29 fasterq-dump.3.0.0: Parsing text for default CA root certificates 2022-04-29T18:35:29 fasterq-dump.3.0.0: Configuring SSl defaults fasterq-dump quit with error code 3 (base) ✘  Fri 29 Apr - 18:35  /data/fasterq-dump-test   docker run -it bridgebio/sra-tools fasterq-dump --threads 2 --progress -vvv SRR1070986/SRR1070986 Preference setting is: Prefer SRA Normalized Format files with full base quality scores if available. 2022-04-29T18:35:33 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-29T18:35:33 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-29T18:35:33 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-29T18:35:33 fasterq-dump.3.0.0: Seeding the random number generator 2022-04-29T18:35:33 fasterq-dump.3.0.0: Loading CA root certificates 2022-04-29T18:35:33 fasterq-dump.3.0.0: Parsing text for default CA root certificates 2022-04-29T18:35:33 fasterq-dump.3.0.0: Configuring SSl defaults fasterq-dump quit with error code 3 (base) ✘  Fri 29 Apr - 18:35  /data/fasterq-dump-test   docker run -it bridgebio/sra-tools fasterq-dump --threads 2 --progress -vvv SRR1070986/SRR1070986.sra Preference setting is: Prefer SRA Normalized Format files with full base quality scores if available. 2022-04-29T18:35:37 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-29T18:35:37 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-04-29T18:35:37 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) SRR1070986/SRR1070986.sra is an SRA Normalized Format file with full base quality scores. 2022-04-29T18:35:37 fasterq-dump.3.0.0: Seeding the random number generator 2022-04-29T18:35:37 fasterq-dump.3.0.0: Loading CA root certificates 2022-04-29T18:35:37 fasterq-dump.3.0.0: Parsing text for default CA root certificates 2022-04-29T18:35:37 fasterq-dump.3.0.0: Configuring SSl defaults fasterq-dump quit with error code 3 ```

The VDB config seems correct because I see:

  <repository>
    <remote>
      <disabled>true</disabled>

Here is the full config output:

`vdb-config --all` output ``` (base) ✘  Fri 29 Apr - 18:30  /data/fasterq-dump-test   docker run -it bridgebio/sra-tools vdb-config --all vdb-config / RELEASE /root ed62e7c0-5f27-4de3-b3d2-96b2565afb17 ed62e7c0-5f27-4de3-b3d2-776250d04266 /root/.ncbi /root/.ncbi/user-settings.mkfg linux true 64 96b2565afb17 /root/.ncbi user-settings.mkfg true true
https://trace.ncbi.nlm.nih.gov/Traces/names/names.fcgi https://locate.ncbi.nlm.nih.gov/sdl/2/retrieve
https://trace.ncbi.nlm.nih.gov/Traces/names/names.fcgi https://locate.ncbi.nlm.nih.gov/sdl/2/retrieve
. . . . . . .
files nannot nannot refseq sra sra sra wgs
raw_scores https://trace.ncbi.nlm.nih.gov/Traces/names/names.fcgi https://locate.ncbi.nlm.nih.gov/sdl/2/retrieve 450m /usr/local/bin
/root/.ncbi/user-settings.mkfg
```

Am I missing something with running fasterq-dump on local files?

durbrow commented 2 years ago

When you docker run, are you mounting the directory into the container? Like so:

docker run -it -v $PWD:/work:rw -w /work --rm bridgebio/sra-tools ls -l

If that lists the directory with the run file, then

docker run -it -v $PWD:/work:rw -w /work --rm bridgebio/sra-tools fasterq-dump <...>

should find it.

olgabot commented 2 years ago

@durbrow Thank you so much, the docker mounted directory was the issue! I now get a new error (yay!) related to the references, which is related to https://github.com/ncbi/sra-tools/issues/202, https://github.com/ncbi/sra-tools/issues/447, https://github.com/ncbi/sra-tools/issues/318:

`fasterq-dump quit with error code 3` log ``` (base) Mon 2 May - 16:37  /data/fasterq-dump-test   docker run -it -v $PWD:/work:rw -w /work --rm bridgebio/sra-tools fasterq-dump --threads 2 --progress -vvv SRR1070986/SRR1070986.sra Preference setting is: Prefer SRA Normalized Format files with full base quality scores if available. 2022-05-02T16:37:23 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-05-02T16:37:23 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-05-02T16:37:23 fasterq-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) SRR1070986/SRR1070986.sra is an SRA Normalized Format file with full base quality scores. 2022-05-02T16:37:23 fasterq-dump.3.0.0: Seeding the random number generator 2022-05-02T16:37:23 fasterq-dump.3.0.0: Loading CA root certificates 2022-05-02T16:37:23 fasterq-dump.3.0.0: Parsing text for default CA root certificates 2022-05-02T16:37:23 fasterq-dump.3.0.0: Configuring SSl defaults lookup :2022-05-02T16:37:24 fasterq-dump.3.0.0: starting background thread loop 2022-05-02T16:37:24 fasterq-dump.3.0.0: collecting batch | 0.00%2022-05-02T16:37:24 fasterq-dump.3.0.0 warn: directory not found while opening manager within virtual file system module - can't open NC_000001.10 as a RefSeq or as a WGS 2022-05-02T16:37:24 fasterq-dump.3.0.0 err: cmn_iter.c cmn_read_String( #1 ).VCursorCellDataDirect() -> RC(rcVFS,rcMgr,rcOpening,rcDirectory,rcNotFound) 2022-05-02T16:37:24 fasterq-dump.3.0.0 err: sorter.c get_from_raw_read_iter( 1 ) -> RC(rcVFS,rcMgr,rcOpening,rcDirectory,rcNotFound) 2022-05-02T16:37:24 fasterq-dump.3.0.0 warn: directory not found while opening manager within virtual file system module - can't open NC_000014.8 as a RefSeq or as a WGS 2022-05-02T16:37:24 fasterq-dump.3.0.0 err: cmn_iter.c cmn_read_String( #45383109 ).VCursorCellDataDirect() -> RC(rcVFS,rcMgr,rcOpening,rcDirectory,rcNotFound) 2022-05-02T16:37:24 fasterq-dump.3.0.0 err: sorter.c run_producer_pool().join_and_release_threads -> RC(rcVFS,rcMgr,rcOpening,rcDirectory,rcNotFound) 2022-05-02T16:37:24 fasterq-dump.3.0.0: KQueuePop() : RC(rcPS,rcSemaphore,rcWaiting,rcTimeout,rcExhausted), store = 0 2022-05-02T16:37:24 fasterq-dump.3.0.0 err: sorter.c execute_lookup_production() -> RC(rcVFS,rcMgr,rcOpening,rcDirectory,rcNotFound) merge : 2022-05-02T16:37:24 fasterq-dump.3.0.0 err: fasterq-dump.c produce_lookup_files() -> RC(rcVFS,rcMgr,rcOpening,rcDirectory,rcNotFound) 2022-05-02T16:37:24 fasterq-dump.3.0.0: KQueuePop() : RC(rcPS,rcSemaphore,rcWaiting,rcTimeout,rcExhausted), store = 0 fasterq-dump quit with error code 3 ```

Here is what I see for the alignment info, which shows the reference information for my accession of interest:

`align-info` output ``` (base) ✘  Mon 2 May - 16:57  /data/fasterq-dump-test   docker run -it -v $PWD:/work:rw -w /work --rm bridgebio/sra-tools align-info -vvv SRR1070986 2022-05-02T16:57:52 align-info.3.0.0: Seeding the random number generator 2022-05-02T16:57:52 align-info.3.0.0: Loading CA root certificates 2022-05-02T16:57:52 align-info.3.0.0: Parsing text for default CA root certificates 2022-05-02T16:57:52 align-info.3.0.0: Configuring SSl defaults GL000191.1,GL000191.1,false,remote GL000192.1,GL000192.1,false,remote GL000193.1,GL000193.1,false,remote GL000194.1,GL000194.1,false,remote GL000195.1,GL000195.1,false,remote GL000196.1,GL000196.1,false,remote GL000197.1,GL000197.1,false,remote GL000198.1,GL000198.1,false,remote GL000199.1,GL000199.1,false,remote GL000200.1,GL000200.1,false,remote GL000201.1,GL000201.1,false,remote GL000204.1,GL000204.1,false,remote GL000205.1,GL000205.1,false,remote GL000206.1,GL000206.1,false,remote GL000208.1,GL000208.1,false,remote GL000211.1,GL000211.1,false,remote GL000212.1,GL000212.1,false,remote GL000213.1,GL000213.1,false,remote GL000214.1,GL000214.1,false,remote GL000215.1,GL000215.1,false,remote GL000216.1,GL000216.1,false,remote GL000217.1,GL000217.1,false,remote GL000218.1,GL000218.1,false,remote GL000219.1,GL000219.1,false,remote GL000220.1,GL000220.1,false,remote GL000221.1,GL000221.1,false,remote GL000222.1,GL000222.1,false,remote GL000223.1,GL000223.1,false,remote GL000224.1,GL000224.1,false,remote GL000225.1,GL000225.1,false,remote GL000227.1,GL000227.1,false,remote GL000228.1,GL000228.1,false,remote GL000229.1,GL000229.1,false,remote GL000230.1,GL000230.1,false,remote GL000231.1,GL000231.1,false,remote GL000232.1,GL000232.1,false,remote GL000233.1,GL000233.1,false,remote GL000235.1,GL000235.1,false,remote GL000236.1,GL000236.1,false,remote GL000237.1,GL000237.1,false,remote GL000238.1,GL000238.1,false,remote GL000239.1,GL000239.1,false,remote GL000240.1,GL000240.1,false,remote GL000241.1,GL000241.1,false,remote GL000242.1,GL000242.1,false,remote GL000243.1,GL000243.1,false,remote GL000247.1,GL000247.1,false,remote GL000248.1,GL000248.1,false,remote GL000249.1,GL000249.1,false,remote NC_000001.10,1,false,remote NC_000002.11,2,false,remote NC_000003.11,3,false,remote NC_000004.11,4,false,remote NC_000005.9,5,false,remote NC_000006.11,6,false,remote NC_000007.13,7,false,remote NC_000008.10,8,false,remote NC_000009.11,9,false,remote NC_000010.10,10,false,remote NC_000011.9,11,false,remote NC_000012.11,12,false,remote NC_000013.10,13,false,remote NC_000014.8,14,false,remote NC_000015.9,15,false,remote NC_000016.9,16,false,remote NC_000017.10,17,false,remote NC_000018.9,18,false,remote NC_000019.9,19,false,remote NC_000020.10,20,false,remote NC_000021.8,21,false,remote NC_000022.10,22,false,remote NC_000023.10,X,false,remote NC_000024.9,Y,false,remote NC_012920.1,MT,true,remote ```

And here is the vdb-dump output

`vdb-dump` output ``` (base) Mon 2 May - 16:57  /data/fasterq-dump-test   docker run -it -v $PWD:/work:rw -w /work --rm bridgebio/sra-tools vdb-dump --info -vvv SRR1070986 Preference setting is: Prefer SRA Normalized Format files with full base quality scores if available. 2022-05-02T16:58:46 vdb-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-05-02T16:58:46 vdb-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-05-02T16:58:46 vdb-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) SRR1070986 is an SRA Normalized Format file with full base quality scores. 2022-05-02T16:58:46 vdb-dump.3.0.0: Seeding the random number generator 2022-05-02T16:58:46 vdb-dump.3.0.0: Loading CA root certificates 2022-05-02T16:58:46 vdb-dump.3.0.0: Parsing text for default CA root certificates 2022-05-02T16:58:46 vdb-dump.3.0.0: Configuring SSl defaults 2022-05-02T16:58:46 vdb-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to 169.254.169.254 (169.254.169.254) 2022-05-02T16:58:47 vdb-dump.3.0.0: KClientHttpOpen - connected from '172.17.0.2' to locate.ncbi.nlm.nih.gov (130.14.29.113) 2022-05-02T16:58:47 vdb-dump.3.0.0: Setting up SSL/TLS structure 2022-05-02T16:58:47 vdb-dump.3.0.0: Performing SSL/TLS handshake... 2022-05-02T16:58:47 vdb-dump.3.0.0: KClientHttpOpen - verifying CA cert 2022-05-02T16:58:47 vdb-dump.3.0.0: Verifying peer X.509 certificate... 2022-05-02T16:58:47 vdb-dump.3.0.0: Reading from server... 2022-05-02T16:58:47 vdb-dump.3.0.0 err: query unauthorized while resolving query within virtual file system module - failed to resolve accession 'SRR1070986' - Access denied - please request permission to access phs000424 / GRU in dbGaP. ( 403 ) acc : SRR1070986 path : /work/SRR1070986/SRR1070986.sra size : 3,841,741,442 type : Database platf : SRA_PLATFORM_ILLUMINA SEQ : 51,448,415 REF : 620,308 PRIM : 90,766,215 SEC : 4,192,942 SCHEMA : NCBI:align:db:alignment_sorted#1.3 TIME : 0x00000000542ed882 (10/03/2014 17:10) FMT : BAM FMTVER : 2.4.1 LDR : bam-load.2.4.1 LDRVER : 2.4.1 LDRDATE: Sep 23 2014 (9/23/2014 0:0) BAMHDR : 142292 bytes / 187 lines BAMHDR : 1 HD-lines BAMHDR : 85 SQ-lines BAMHDR : 25 RG-lines BAMHDR : 75 PG-lines ```

I need to run this on ~1500 files. I see that I can create a /etc/ncbi/user-settings.kfg file with pre-specified paths of RefSeq files here: https://github.com/ncbi/sra-tools/issues/416#issuecomment-802946574.

Is it possible to prefetch the references only? I want to make sure I'm downloading the exact version of the genome, with the correct folder/file structure that is necessary for fasterq-dump. I don't see the option to fetch only references in prefetch --help, but maybe I'm missing something.

`prefetch --help` output ``` (base) Mon 2 May - 16:58  /data/fasterq-dump-test   docker run -it -v $PWD:/work:rw -w /work --rm bridgebio/sra-tools prefetch --help Usage: prefetch [ options ] [ accessions(s)... ] Parameters: accessions(s) list of accessions to process Options: -T|--type Specify file type to download. Default: sra -N|--min-size Minimum file size to download in KB (inclusive). -X|--max-size Maximum file size to download in KB (exclusive). Default: 20G -f|--force Force object download - one of: no, yes, all, ALL. no [default]: skip download if the object if found and complete; yes: download it even if it is found and is complete; all: ignore lock files (stale locks or it is being downloaded by another process - use at your own risk!); ALL: ignore lock files, restart download from beginning -p|--progress Show progress -r|--resume Resume partial downloads - one of: no, yes [default] -C|--verify Verify after download - one of: no, yes [default] -c|--check-all Double-check all refseqs -S|--check-rs Check for refseqs in downloaded files: one of : no, yes, smart[default]. Smart: skip check for large encrypted non - sra files -o|--output-file Write file to when downloading single file -O|--output-directory Save files to / --ngc to ngc file --perm to permission file --location location in cloud --cart to cart file -V|--version Display the version of the program -v|--verbose Increase the verbosity of the program status messages. Use multiple times for more verbosity. -L|--log-level Logging level as number or enum string. One of (fatal|sys|int|err|warn|info|debug) or (0-6) Current/default is warn --option-file file Read more options and parameters from the file. -h|--help print this message "prefetch" version 3.0.0 ```

Thank you so much!!

klymenko commented 2 years ago

Is it possible to prefetch the references only?

You can prefetch the references for already prefetched run. If you have just SRR1070986.sra in SRR1070986 and no refseqs - run prefetch SRR1070986/SRR1070986.sra

klymenko commented 2 years ago

@olgabot, did you resolve your issue?

olgabot commented 2 years ago

Yes, I needed to have this user-settings.mkfg file:

/repository/remote/disabled = "true"

# This forces usage of a local refseq folder instead of pulling from NCBI every time
/repository/site/main/archive/apps/refseq/volumes/refseq = "refseq"
/repository/site/main/archive/root = "PWD"

And do some fun file gymnastics in my pipeline code to make it work:

    # Combine pipeline-provided plus vdb-configured ncbi settings into one
    # This forces usage of a local refseq folder instead of pulling from NCBI every time
    sed "s:PWD:\$PWD:" ${ncbi_settings} | cat - \$NCBI_SETTINGS >> new_ncbi_settings.mkfg
    echo "\n--- cat NCBI_SETTINGS ---"
    cat \$NCBI_SETTINGS
    mv ${ncbi_settings} old_ncbi_settings.txt
    echo '\n--- cat new_ncbi_settings.mkfg ---'
    cat new_ncbi_settings.mkfg
klymenko commented 2 years ago

You generate the path to site repository every time. Don't you have a permanent value?

You can create a directory with configuration files (*.kfg) and export VDB_CONFIG to point to this directory instead of creating ~/.ncbi/user-settings.mkfg.

olgabot commented 2 years ago

Hello, Unfortunately, I do not have a permanent value. This is for a Nextflow pipeline on AWS Batch, so each individual pipeline run is run in its own sandboxed environment with a docker container and custom path, so I cannot reference an absolute path. This is the best workaround I've found, as sra-tools uses only filesystems, while Nextflow can use both filesystems and blob stores, and I couldn't use e.g. an s3:// path in the user-settings.mkfg file as it must be on a filesystem. The best I can do is to create a local .mkfg file. Warmest, Olga