nf-core / hlatyping

Precision HLA typing from next-generation sequencing data
https://nf-co.re/hlatyping
MIT License
61 stars 30 forks source link

Make the pipeline run with different file sizes #108

Open szymonwieloch opened 3 years ago

szymonwieloch commented 3 years ago

Hi! I have a problem with running this pipeline. It seems that it incorrectly chooses memory requirements for files. This is especially problematic with very big files. My biggest file during tests was 16 GB, but in the future we may have much bigger. Such a size requires 256 GB of memory for the run_optitype process.

My issue is that the hlatyping pipeline does not allow you by default to handle such a big files. The only workaround that I found was creation of a additional configuration file extra.config and passing it to the nextflow with the -c parameter to override the default configuration. My expectation is that the pipeline should allow you to process your data only using command line parameters. This didnt work because:

1. Problems with setting maxRetries

For some strange reason when I tried to increase retries with -process.maxRetries 5 it didn't work and the default value of 1 was used. When I tried to set 'maxRetries = 5' in the extra.config file, for some strange reason I saw 2 retries. All failing processes were finishing with the 137 error code and should be retried 5 times with increasing memory. I am not sure if this is a problem with this pipeline or with NextFlow, however I haven't experienced such problems with other pipelines.

2. Slow memory adaptation mechanism

Te current memory adaptation mechanism is extremely slow:

memory = { check_max( 7.GB * task.attempt, 'memory' ) }

To reach required 256 GB of RAM for my samples it would require 37 retries. To process a 50 GB sample file - around 116. There are two good approaches to fix that:

A. Change the algorithm and use an exponential adaptation algorithm:

memory = {8.GB * (2^(task.attempt - 1))}

This would only require 6 retries for 16 GB file, 8 retries for a 50 GB file, and wouldn't cause huge resource overhead.

B. Calculate memory requirement from the input file size.

The task object should give you access to the input files. This allows you to check the sample size and calculate the amount of required memory. I suspect that there is a linear relation between the input file size and the actual memory requirement. A simple linear equation should give your a precise amount of memory needed to process the given sample. This approach requires more work: obtaining real memory usage for several samples and checking the actual relationship but eventually no retries would be needed.

christopher-mohr commented 3 years ago

Hi @szymonwieloch, thanks for reporting and providing detailed information on this. We will check this and get back to you.