harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

Add option to download from ENA #48

Closed tsackton closed 2 years ago

tsackton commented 2 years ago

Some datasets are not available from SRA, but at least based on preliminary spot checking, are available from ENA. While this could be handled with manual downloading and local fastqs, to improve robustness and reliability the best solution is to add an option to download from ENA.

To do this, we will use the Java ENA downloader command line application, and modify the get_fastq_pe rule to try ENA if the NCBI download returns a non-zero exit code.

Currently working on this issue in the dev branch.

tsackton commented 2 years ago

Update: Java downloader is not maintained, and the python scripts throw a malformed URL error that has been a Github issue for over a month without being addressed. So going to constructing our own wget statement.

Please report use cases where the current URL construction breaks, as it appears that the current ENA FTP structure may not be entirely predictable based on accession.

tsackton commented 2 years ago

Tentatively have a working solution in dev. Some issues/questions:

For speed/efficiency/code simplicity, I have refactored the get_fastq_pe rule to:

  1. Attempt to download the .sra file from NCBI using prefetch
  2. If that fails, attempt to download the .sra file from ENA using wget
  3. Convert the .sra to fastq using fasterq-dump

There are a few quirks to this procedure, however:

  1. Despite having an option to specify a directory other than the current working directory for prefetch, fasterq-dump does not like processing downloaded files that are not in the current working directory based on my initial testing. In theory there should be a way to do this but I have not attempted to solve this. The consequence is that running the pipeline will now produce a number of temporary .sra files in the working directory while snakemake is running. Thoughts on how annoying this is? These files are deleted once fasterq-dump runs successfully.
  2. This may be marginally less CPU efficient, as prefetch/wget are not multi-threaded, while fasterq-dump is. But I suspect that fasterq-dump is just doing prefetch under the hood anyway, so it shouldn't matter.
  3. On the flip side, this may be marginally more robust to network issues, as prefetch can restart failed downloads.

I will leave this up for comment/discussion/further testing for a bit, and will submit pull request once we are all satisfied and I have had a chance to test a few species runs with this new code.

tsackton commented 2 years ago

Adding another note: so far I have found a few cases where ENA does not have .sra files, but does appear to have fastq.gz files. Could add an option to try there...

ERR2697483 is an example of this case.

Editing to add, the issue with my current solution to this is that gzip and wget are not parallelized, so we are wasting CPU resources during these phases of the job. Only fasterq-dump takes advantage of the 10 CPUs. An option here would be to replace gzip with pigz.

The wget command to download an .sra file is not parallelizable as it is just one file, but downloading fastq.gz could be parallelized in various ways, probably most simply with gnu parallel or xargs.