rpetit3 / fastq-dl

Download FASTQ files from SRA or ENA repositories.
MIT License
268 stars 24 forks source link

Perl version control #25

Open pooser opened 4 months ago

pooser commented 4 months ago

The current version of Perl associated with fastq-dl is 5.22.0 which conflicts with the current version of Bactopia's (v3.0.0) Perl version which is 5.26.0. I ran into this conflict when loading Bactopia and fasq-dl as module files. Food for thought. TIA!

rpetit3 commented 4 months ago

Hi @pooser

Thank you for reporting! Quick question were bactopia and fastq-dl installed into the same Conda environment?

Haha I also might need to look at the dependencies again, because its a little funny that Perl is causing issues with two mostly Python based tools!

EDIT

fastq-dl --- sra-tools brings in Perl bactopia -- v3 doesn't seem to pull in Perl

pooser commented 4 months ago

Greetings @rpetit3 !

As per the no docker/singularity install instructions for Bactopia (yeah, yeah, I know...I'm a glutton for punishment, ha!) one should install Miniforge3 and then leverage its conda environment to then install Bactopia. I did exactly that and then realized I needed your nifty tool to download thousands of sequence data sets.

To manage this environment, I utilized module files and my default was to have Miniforge3, Bactopia, and fastq-dl loaded simultaneously. Using fastq-dl to fetch the SRA data went fine however, once I pointed bactopia to the data, it crashed with Perl v5.26.0 required--this is only v5.22.0

With all three modules loaded I find which perl -> Miniforge3/envs/bactopia/envs/fastq-dl/bin/perl With only Miniforge and Bactopia loaded I find the expected which perl -> /usr/bin/perl

Ergo, I concur that fasq-dl brings in perl while bactopia does not.

This is not a big issue of course its just that as a total bactopia newb, I though others may have run into this as well.

rpetit3 commented 4 months ago

Ah, I see now haha you are a glutton for punishment.

A few options here.

  1. Use two separate environments
    
    conda create -n fastq-dl -c conda-forge -c bioconda fastq-dl
    conda activate fastq-dl
    ... download your samples ...

conda deactivate conda create -n bactopia -c conda-forge -c bioconda bactopia conda activate bactopia ... process your samples ...



2. Use the `--accessions` parameter in Bactopia.
With this you can provide all the Experiment accessions for your samples and Bactopia will handle the downloading (via fastq-dl). You can make use of `bactopia search` to help with this.

here are some links: 

https://bactopia.github.io/latest/tutorial/#multiple-samples
https://bactopia.github.io/latest/beginners-guide/#accessions

Let me know if this helps, if not, please don't hesistate to let me know! haha we'll get this figured out
pooser commented 4 months ago
  1. Thank you for the suggestions. I was effectively doing the same thing by adding and removing fastq-dl relative to the respective stage of the pipeline.

  2. What I am particularly interested in is obtaining large amounts of sequence data files not to analyze them with bactopia but instead train AI models to generate synthetic sequence data specific to genus and species. To do this, I have been using bactopia search to generate the accession list which I then parse and feed to fastq-dl to fetch the data.

Does bactopia have a built in mechanism to both search and retrieve the data whilst not executing the analysis? I am open to any suggestions you might have here. FWIW I am not a biologist/bioinformatician and am instead simply treating this an advanced data processing problem.

rpetit3 commented 4 months ago

Oh, this is very interesting.

There isn't a mechanism to directly do this, but I imagine you could indirectly do it by setting --min_basepairs or --min_reads to something unrealistically high. This would cause it to fail the gather step in Bactopia. You could test for a single accession to see if it works as expected