kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

getfastq download option #21

Closed kfuku52 closed 3 years ago

kfuku52 commented 3 years ago

Since NCBI migrated to AWS/GCP, the ASPERA/FTP downloads are not working. We should deprecate related options.

I'm not sure if prefetch tries downloading from NCBI's server only, but if so, we should add new options for the https download from AWS and GCP. AWS download was significantly faster than NCBI, at least in this data. https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR3638806

Hego-CCTB commented 3 years ago

I've tried playing around with aws, but I can't make an account (I don't own a credit card). From what I've seen, it should be doable to implement as an alternative, though. The only immediate obstacle that comes to mind is finding the s3 or https link from the metadata.

kfuku52 commented 3 years ago

You don't need an account for download. NCBI download link: https://sra-download.ncbi.nlm.nih.gov/traces/era20/ERR/ERR3638/ERR3638806 AWS download link: https://sra-pub-run-odp.s3.amazonaws.com/sra/ERR3638806/ERR3638806

kfuku52 commented 3 years ago

Please check the "Data access" tab. https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR3638806

Hego-CCTB commented 3 years ago

Ah, but that's via browser. I assumed I need the aws shell to download, but wget works just fine. This should make things manageable.

Hego-CCTB commented 3 years ago

I tested out a couple of SRA IDs and it looks like the amazon download link is always: https://sra-pub-run-odp.s3.amazonaws.com/sra/SRA-ID/SRA-ID and has to be processed by fastq-dump afterwards. I can have amalgkit create a download link, download with wget & fastq-dump and feed it into post processing with the existing pipeline. This should be fairly straight forward. Maybe I can have amalgkit getfastq download via current pipeline, if this download fails.

The NCBI-link is a different issue, since the link can be constructed differently between different SRA-IDs. For example:

https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-19/SRR10303736/SRR10303736.1 https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-1/DRR002341/DRR002341.1

I will try to find a way to query SRA for a download link from inside amalgkit. It's not part of the metadata, but it must be retrievable somehow.

GCP links are also not as rigid in structure. Then there is the issue of Free Egress, which may require accounts if not free at all, or is only free in the US. This is especially the case with GCP. For the five IDs I manually checked, AWS was always free worldwide.

Hego-CCTB commented 3 years ago

OK. With the latest version of sra-tools, I can retrieve the AWS, NCVI or GCP links with srapath SRR10303736. This will require the user to configure their sra-tools via vdb-config like this: https://edwards.sdsu.edu/research/accessing-sra-in-the-cloud/

I can only retrieve one of the three links, depending on how vdb-config is set up.

kfuku52 commented 3 years ago

I want to bypass vdb-config (...and hence prefetch and srapath) because it doesn't work in Docker/Singularity containers currently. We still need fastq-dump but I assume that the vdb-config's authentification isn't necessary for it (need to make sure).

The download links seem to be included in SRA's XML file. You can modify Metadata.from_xml() to add the links to the metadata table in amalgkit metadata, and then sequentially try those links to download .sra with wget in amalgkit getfastq.

SRA_Apis_mellifera.xml.zip

image

Hego-CCTB commented 3 years ago

Thanks for the tip! I could successfully add all three links to the metadata.tsv with this code:

   items.append(["NCBI_Link", entry.xpath('./RUN_SET/RUN/SRAFiles/SRAFile[@supertype="Primary ETL"]/Alternatives[@org="NCBI"]/@url')])
   items.append(["AWS_Link", entry.xpath('./RUN_SET/RUN/SRAFiles/SRAFile[@supertype="Primary ETL"]/Alternatives[@org="AWS"]/@url')])
   items.append(["GCP_Link", entry.xpath('./RUN_SET/RUN/SRAFiles/SRAFile[@supertype="Primary ETL"]/Alternatives[@org="GCP"]/@url')])

For handling download from within python, I'm thinking about two options:

I'd like to keep dependencies at a minimum, but there is a chance that urllib.request will break in the future. Do you have any thoughts on this?

kfuku52 commented 3 years ago

urllib.request itself will be OK as long as we avoid legacy interface. urllib is Python's standard library, but some OS distributes don't have wget, so why don't we prioritize urllib.request. The code chunk may look like this.

urls = [amazon, gcp, ncbi]
methods = ['urllib', 'wget']
for method,url in itertools.product(methods, urls):
    try:
        new_download_fun(url, method)
        print('happily downloaded!')
        break
    except:
        print('gosh!')
Hego-CCTB commented 3 years ago

Amalgkit version 0.5.2.0 -- Download of .sra files is now possible via AWS, NCBI or GCP, which seems indeed much faster than prefetch -- This can be done by setting any of the three flags --aws,--ncbi,--gcp to yes

https://github.com/kfuku52/amalgkit/commit/be63870a54d7acf47db87fa8216f585231c54d16

kfuku52 commented 3 years ago

Please use strtobool but not str to yes/no options, and the default should be all yes.

Also, can you follow my pseudocode above? In your code, amalgkit will not try to reach NCBI and GCP when AWS is unavailable. Is there any reason why you implemented it that way?

Hego-CCTB commented 3 years ago

Of course! This is a work in progress build and fairly bare bones. wget method is also not implemented yet. I need to do some testing, which errors are getting thrown due to unavailabilty and which are actual issues with the download method.

kfuku52 commented 3 years ago

I'm not quite sure if we are on the same page. With the current implementation, could you tell me what do you expect to happen if --aws,--ncbi,--gcp are all yes but AWS is unreachable?

Hego-CCTB commented 3 years ago

In the current, simple implementation. getfastq will say, "You set multiple sources to yes, we will only try the first one (AWS)", when AWS is unavailable, it'll say: "this service may be unavailable, please try a different one" and just terminate.

Next step would be to actually try all sources, and if everything fails, either terminate or switch to standard prefetch.

When this works with urllib.request, I'll additionally try all three sources with wget as well, just in case it's a different issue.

kfuku52 commented 3 years ago

Sounds good, thank you for clarifying.

Hego-CCTB commented 3 years ago

Amalgkit Version 0.5.2.1

-- changed --aws, --ncbi and --gcp to strtobool type (default no) -- will now try all sources with the yes option and goes to prefetch if none of them work (line 200) -- adjusted formatting of getfastq https://github.com/kfuku52/amalgkit/commit/932fb286bd511e2f3b88e23924c580824d59c52c and https://github.com/kfuku52/amalgkit/commit/8f3fb6ce8740001aa215acba5965703e5a0e7994#diff-3c0a8df92992cf9efdc5d528c66dfd60a7c42499c72901d81a4f83346f76191c

I will now try to add wget download method as well.

Hego-CCTB commented 3 years ago

Amalgkit Version 0.5.2.2 -- wget is now also supported -- wget is not required and will only be tried if urllib.request has already failed. https://github.com/kfuku52/amalgkit/commit/73feb7e1825e4df423b71eb470c0d746eff0164e