replikation / What_the_Phage

WtP: Phage identification via nextflow and docker or singularity
https://mult1fractal.github.io/wtp-documentation/
GNU General Public License v3.0
103 stars 15 forks source link

Evaluation of large input files #45

Closed hoelzer closed 3 years ago

hoelzer commented 4 years ago

This issue is for documentation of the behavior of WtP for large input files. Based on this @replikation might implement FASTA chunking to increase speed of the pipeline.

case 1, aquadiva sample

bsub -n 4 -M 8.0G -R "rusage[mem=8.0G]" "nextflow run phage.nf --fasta /homes/mhoelzer/data/calc/aquadiva_kaiju/spades/H14_0_2_1/scaffolds.fasta --output /homes/mhoelzer/data/calc/aquadiva_kaiju/wtp/H14_0_2_1 -profile ebi --mp -resume"

(execluded metaphinder because of an previous issue)

started: Dec 31 12:50

Tools completed

Job was aborted after 2.5 days by cluster for unclear reason. No stats for deepvirfinder and marvel

hoelzer commented 4 years ago
Screenshot 2020-01-08 10 06 50

large input files (500MB-1GB) are working with virsorter and pprmeta. I will test the other tools.

However, the r_plot process takes much time and seems even not be able to terminate for some files. Besides, the visualization is not really usefull for large input sets so I will deactivate it in a separate branch for my test runs.

hoelzer commented 4 years ago

Update: Virfinder finished after 18h for one of the large input files (~500 MB) fasta.

Now testing Marvel

replikation commented 4 years ago

so you dropped all the 29 metagenomes with > 1mio contigs (for each sample) on it? :D okay interesting... ill try out a few things and report back

hoelzer commented 4 years ago

yeah... I thought the EBI cluster is huge so just go for it WtP! :D

At the moment I am just running one sample with the -resume option adding more and more tools (currently Marvel is running).

replikation commented 4 years ago

marvel is super difficult to implement here. as its analysing "bins" by default. so i need to split each contig into a separate fasta file. and you have 1-2 mio contigs per file

hoelzer commented 4 years ago

uff I see. Maybe skipping Marvel if too many contigs are provided? I mean, it's just due to how Marvel is implemented and not reallt an issue of WtP

replikation commented 4 years ago

yep i was thinking about an "autoconfig" depending on the "assemblystats" of the input e.g. to many contigs -> deactivate tool x and y -> contig to large -> deactivate deepvirfinder etc.

hoelzer commented 4 years ago

I think that is a good idea and report back to the user what was deactivated and why.

replikation commented 4 years ago

these issue information are for #47