gjeunen / reference_database_creator

creating reference databases for amplicon sequencing
MIT License
22 stars 8 forks source link

insilico_pcr and pga functions fail when running multiple job at once: an intermediate files conflict? #61

Open ThomasLuypaert opened 2 weeks ago

ThomasLuypaert commented 2 weeks ago

Hi @gjeunen,

I would like to extract the amplicon regions from the same database using the insilico_pcr and pga functions for a large number of primers. I tried setting up a SLURM job to do this, but the jobs were failing.

This is the error message for one of the jobs that failed:

(base) [thlu@login ref_db_downloads]$ cat extract_ampseq_verts_deagle_error
100%|██████████| 1644979368/1644979368 [00:14<00:00, 115647037.50it/s]
100%|██████████| 29726034/29726034 [00:00<00:00, 32807761.06it/s]
100%|██████████| 1134958850/1134958850 [00:03<00:00, 340319973.96it/s]
100%|██████████| 232056/232056 [00:00<00:00, 16543664.42it/s]
 86%|████████▌ | 25805888/29958090 [00:04<00:00, 5813128.26it/s]
 99%|█████████▊| 1623994165/1644979368 [00:36<00:00, 43926387.21it/s]
100%|██████████| 1132228635/1132228635 [00:16<00:00, 70604909.90it/s] vsearch v2.21.1_linux_x86_64, 188.7GB RAM, 32 cores
https://github.com/torognes/vsearch

Reading file Vertebrates_16S_deagle_db_merged.fasta 100%
25758905 nt in 374592 seqs, min 32, max 13886, avg 69
minseqlength 32: 2397 sequences discarded.
Masking 100%
Counting k-mers 100%
Creating k-mer index 100%
Searching 100%
Matching unique query sequences: 3509 of 54810 (6.40%)

0it [00:00, ?it/s]

I think the reason the jobs failed has something to do with the intermediate files that are created by these functions having a fixed name (e.g. init_untrimmed.fasta, init_trimmed.fasta, CRABS_pga_info.txt, CRABS_pga.fasta, CRABS_pga_subset.fasta). When running multiple jobs simultaneously in the same folder, would these intermediate files not overwrite each other and mess with the other jobs? Or, when one job finished, would it clean up and remove the intermediate file, making it unavailable to the other jobs?

I apologize if I am overlooking something, I am relatively new to using the HPC. Generally, do you have any advice to perform the insilico_pcr and pga functions on many different primer pairs at once? As far as I can tell, the functions only take a single forward and reverse primer. Could it be useful to implement a change that allows the functions to be supplied by a .txt file containing details on input datasets, primer pairs, output names?

I appreciate any insights.

Cheers, Thomas

hughcross commented 2 weeks ago

Hi @ThomasLuypaert, Thanks for continuing to explore Crabs. As for running multiple jobs at once, I think your diagnosis is correct that the intermediate files are messing with each other. For the coming major update to Crabs, we might be able to add a timestamp or random number to each intermediate file so multiple primers could be analyzed at once. We will look into this.

As for the second part of your comment, I am a bit confused. The answer depends on whether the different primer pairs amplify overlapping gene regions. If they do not overlap, you should run these functions separately. If you have the case where you are amplifying the same region with different primers, that is a little trickier. For the insilico_pcr function, in the upcoming update we may be able to add an option for looking at multiple primers at once, but for now you would have to run the insilico_pcr function separately for each primer pair. I suppose you could then merge the insilico_pcr results and run a single pga function on the combined results. I am not sure if this would get everything, but if you set __filter_method to 'relaxed' this should get all regions.

Ideally, the final reference sequences for overlapping primer sets should all cover the longest combined primer region. For example, one primer set starts at position 105 and ends at position 305; another primer starts at position 110 and ends at 310. For the reference sequences, you would want them to start at position 105 and end at 310, so the combined region is covered completely. I can think of ways to address this programmatically but for now you can try the steps as outlined above and see how that goes.

Let us know how you progress, and if I misunderstood your question please clarify. It is a great suggestion to add options for covering multiple overlapping primers for the same region, as this is a recommended way of avoiding primer bias from any one set. It will take us some time to implement this, but in the meantime please keep posting and we will try to help.

Any thoughts, @gjeunen ? You may have a better answer.

ThomasLuypaert commented 1 week ago

Hi @hughcross,

Thanks for getting back to me so quickly!

I have run into similar issues with intermediate file conflicts when downloading multiple databases at once from NCBI, so that future implementation with timestamped intermediate files sounds great 👍.

My comment was not referring specifically to primers targeting overlapping regions, although the functionality you propose could be useful for what I am working on. I will give your suggestion to merge the insilico_pcroutputs before running pga a try.

My original comment was referring more to options to start many different insilico_pcr and pga jobs for a large number of different primers using a single command, for instance by calling on an external file that contains all the required information. However, this is not really a necessity, for now I can just enter the information manually.

Not related to this thread, but I also noticed that the issue we discussed on a different thread (with the MitoFish download failing due to the absence of unzip) was not resolved using the new docker image you created. Not sure if this is an issue specific to my computer, or whether somehow the command can still not reach unzip. Additionally, I discovered an issue with download data from the EMBL database which seems to be related to this issue. I tried the proposed solution (capitalizing the database query), however the downloaded database I get is is still empty. Let me know if I should report a new issue to report this behavior.

Sorry for spamming you with issues, I am really enjoying exploring the full CRABS functionality.

Cheers, Thomas

hughcross commented 1 week ago

Thanks for the clarification, @ThomasLuypaert, I understand better what you are looking for. For the other issues, I will comment on those in the other threads. Thanks for pointing those out. I neglected to check the Mitofish unzip when I tested the new docker image. I will need to fix that.

I have not seen many examples of starting multiple jobs at once that are not related to each other. Mostly what I see are pipeline applications, that allow you to run multiple processes on a single dataset, for example running insilico_pcr, pga, assign_tax etc., with one command. There are many options for these, such as Snakemake, nextflow etc. If you have access to a server that uses slurm to schedule jobs (or similar systems), I believe you can start a job that starts multiple jobs. You might also use arrays, which splits up a job and runs them simultaneously. Typically arrays are used to split up a single job, but for running the same process on unrelated data that might work. Arrays can be tricky, though.

If you are comfortable with Docker, for what you want to do I suggest docker compose. I have been using this a lot and it is fairly straightforward and easy to modify. You create a file in yaml format that describes each docker image and what you want it to do, and you just run the command docker compose with the name of the file, for example:

docker compose --file cutadapt_merge_compose_f230.yaml up

where cutadapt_merge_compose_f230.yaml is the name of the file.

Typically this is used for a pipeline. I have not used this for running crabs, but it would not be hard to set up. If you are interested I can put an example on this repo. For each docker command, you can specify the order and what command are needed to complete before starting each command. For crabs, you might start with insilico_pcr, and then run pga, and in the pga specification you can say not to start that until the insilico_pcr command completes. What I have noticed is that it will start running commands at the same time as long as the previous command has completed. As an example from my work, I have a compose file that first runs cutadapt to trim metabarcoding sequences, and when that finishes, it starts to merge the paired ends using pear. In the same file is a command to run Fastqc on the trimmed sequences, which also needs cutadapt to complete. What happens is that once the cutadapt command completes, both the merging and the Fastqc start at the same time and run simultaneously (I think computer guys call this 'asynchronous').

This could work for your purposes. A single file could include commands to run insilico_pcr, pga, and assign_tax on different markers at the same time. As long as the pga command for 18S knows that it has to wait for insilico_pcr on 18S to finish, and that the pga for COI knows it has to wait for its insilico_pcr to finish, and so on. I have not tried anything like this--just pipelines on a single dataset--but it should work. The good thing would be that all the markers would be running at the same time. This would finish much faster than if you set up a for loop to run through each marker one at a time (which would be simpler, but take longer). The important thing to remember is that all these commands running at the same time could use a lot of resources, so you have to make sure you allocate enough RAM and CPUs. If you are running these on a laptop, I wouldn't try it.

Oh, and back to your temp files problem: you can have these different docker commands running in different folders so temporary files with the same name would not overwrite each other.

I hope this helps. Thanks for the questions.

Cheers,

Hugh