Closed kfuku52 closed 3 years ago
I've tried playing around with aws, but I can't make an account (I don't own a credit card). From what I've seen, it should be doable to implement as an alternative, though. The only immediate obstacle that comes to mind is finding the s3 or https link from the metadata.
You don't need an account for download. NCBI download link: https://sra-download.ncbi.nlm.nih.gov/traces/era20/ERR/ERR3638/ERR3638806 AWS download link: https://sra-pub-run-odp.s3.amazonaws.com/sra/ERR3638806/ERR3638806
Please check the "Data access" tab. https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR3638806
Ah, but that's via browser. I assumed I need the aws shell to download, but wget
works just fine. This should make things manageable.
I tested out a couple of SRA IDs and it looks like the amazon download link is always:
https://sra-pub-run-odp.s3.amazonaws.com/sra/SRA-ID/SRA-ID
and has to be processed by fastq-dump afterwards. I can have amalgkit create a download link, download with wget
& fastq-dump and feed it into post processing with the existing pipeline. This should be fairly straight forward. Maybe I can have amalgkit getfastq
download via current pipeline, if this download fails.
The NCBI-link is a different issue, since the link can be constructed differently between different SRA-IDs. For example:
https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-19/SRR10303736/SRR10303736.1 https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-1/DRR002341/DRR002341.1
I will try to find a way to query SRA for a download link from inside amalgkit. It's not part of the metadata, but it must be retrievable somehow.
GCP links are also not as rigid in structure. Then there is the issue of Free Egress, which may require accounts if not free at all, or is only free in the US. This is especially the case with GCP. For the five IDs I manually checked, AWS was always free worldwide.
OK. With the latest version of sra-tools
, I can retrieve the AWS, NCVI or GCP links with srapath SRR10303736
. This will require the user to configure their sra-tools
via vdb-config like this:
https://edwards.sdsu.edu/research/accessing-sra-in-the-cloud/
I can only retrieve one of the three links, depending on how vdb-config is set up.
I want to bypass vdb-config (...and hence prefetch and srapath) because it doesn't work in Docker/Singularity containers currently. We still need fastq-dump but I assume that the vdb-config's authentification isn't necessary for it (need to make sure).
The download links seem to be included in SRA's XML file. You can modify Metadata.from_xml() to add the links to the metadata table in amalgkit metadata
, and then sequentially try those links to download .sra with wget in amalgkit getfastq
.
Thanks for the tip! I could successfully add all three links to the metadata.tsv with this code:
items.append(["NCBI_Link", entry.xpath('./RUN_SET/RUN/SRAFiles/SRAFile[@supertype="Primary ETL"]/Alternatives[@org="NCBI"]/@url')])
items.append(["AWS_Link", entry.xpath('./RUN_SET/RUN/SRAFiles/SRAFile[@supertype="Primary ETL"]/Alternatives[@org="AWS"]/@url')])
items.append(["GCP_Link", entry.xpath('./RUN_SET/RUN/SRAFiles/SRAFile[@supertype="Primary ETL"]/Alternatives[@org="GCP"]/@url')])
For handling download from within python, I'm thinking about two options:
wget
module is very simple to use, but requires an additional dependency to be installedurllib.request
is similarly simple in usage and comes with python, but may be deprecated in the futureI'd like to keep dependencies at a minimum, but there is a chance that urllib.request
will break in the future. Do you have any thoughts on this?
urllib.request
itself will be OK as long as we avoid legacy interface. urllib
is Python's standard library, but some OS distributes don't have wget
, so why don't we prioritize urllib.request
. The code chunk may look like this.
urls = [amazon, gcp, ncbi]
methods = ['urllib', 'wget']
for method,url in itertools.product(methods, urls):
try:
new_download_fun(url, method)
print('happily downloaded!')
break
except:
print('gosh!')
Amalgkit version 0.5.2.0
-- Download of .sra files is now possible via AWS, NCBI or GCP, which seems indeed much faster than prefetch
-- This can be done by setting any of the three flags --aws
,--ncbi
,--gcp
to yes
https://github.com/kfuku52/amalgkit/commit/be63870a54d7acf47db87fa8216f585231c54d16
Please use strtobool
but not str
to yes/no options, and the default should be all yes
.
Also, can you follow my pseudocode above? In your code, amalgkit will not try to reach NCBI and GCP when AWS is unavailable. Is there any reason why you implemented it that way?
Of course! This is a work in progress build and fairly bare bones. wget
method is also not implemented yet. I need to do some testing, which errors are getting thrown due to unavailabilty and which are actual issues with the download method.
I'm not quite sure if we are on the same page. With the current implementation, could you tell me what do you expect to happen if --aws,--ncbi,--gcp are all yes but AWS is unreachable?
In the current, simple implementation. getfastq
will say, "You set multiple sources to yes, we will only try the first one (AWS)", when AWS is unavailable, it'll say: "this service may be unavailable, please try a different one" and just terminate.
Next step would be to actually try all sources, and if everything fails, either terminate or switch to standard prefetch.
When this works with urllib.request
, I'll additionally try all three sources with wget
as well, just in case it's a different issue.
Sounds good, thank you for clarifying.
Amalgkit Version 0.5.2.1
-- changed --aws
, --ncbi
and --gcp
to strtobool
type (default no
)
-- will now try all sources with the yes
option and goes to prefetch
if none of them work (line 200)
-- adjusted formatting of getfastq
https://github.com/kfuku52/amalgkit/commit/932fb286bd511e2f3b88e23924c580824d59c52c
and
https://github.com/kfuku52/amalgkit/commit/8f3fb6ce8740001aa215acba5965703e5a0e7994#diff-3c0a8df92992cf9efdc5d528c66dfd60a7c42499c72901d81a4f83346f76191c
I will now try to add wget
download method as well.
Amalgkit Version 0.5.2.2
-- wget
is now also supported
-- wget
is not required and will only be tried if urllib.request
has already failed.
https://github.com/kfuku52/amalgkit/commit/73feb7e1825e4df423b71eb470c0d746eff0164e
Since NCBI migrated to AWS/GCP, the ASPERA/FTP downloads are not working. We should deprecate related options.
I'm not sure if
prefetch
tries downloading from NCBI's server only, but if so, we should add new options for the https download from AWS and GCP. AWS download was significantly faster than NCBI, at least in this data. https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR3638806