Closed tsackton closed 2 years ago
Update: Java downloader is not maintained, and the python scripts throw a malformed URL error that has been a Github issue for over a month without being addressed. So going to constructing our own wget statement.
Please report use cases where the current URL construction breaks, as it appears that the current ENA FTP structure may not be entirely predictable based on accession.
Tentatively have a working solution in dev. Some issues/questions:
For speed/efficiency/code simplicity, I have refactored the get_fastq_pe rule to:
There are a few quirks to this procedure, however:
I will leave this up for comment/discussion/further testing for a bit, and will submit pull request once we are all satisfied and I have had a chance to test a few species runs with this new code.
Adding another note: so far I have found a few cases where ENA does not have .sra files, but does appear to have fastq.gz files. Could add an option to try there...
ERR2697483 is an example of this case.
Editing to add, the issue with my current solution to this is that gzip and wget are not parallelized, so we are wasting CPU resources during these phases of the job. Only fasterq-dump takes advantage of the 10 CPUs. An option here would be to replace gzip with pigz.
The wget command to download an .sra file is not parallelizable as it is just one file, but downloading fastq.gz could be parallelized in various ways, probably most simply with gnu parallel or xargs.
Some datasets are not available from SRA, but at least based on preliminary spot checking, are available from ENA. While this could be handled with manual downloading and local fastqs, to improve robustness and reliability the best solution is to add an option to download from ENA.
To do this, we will use the Java ENA downloader command line application, and modify the get_fastq_pe rule to try ENA if the NCBI download returns a non-zero exit code.
Currently working on this issue in the dev branch.