ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly
GNU General Public License v3.0
497 stars 91 forks source link

Specify amount of memory (RAM) to be used #233

Open Sabryr opened 3 years ago

Sabryr commented 3 years ago

Is it possible to specify amount of memory (RAM) to be used instead of automatically detecting the amount of RAM?

ruanjue commented 3 years ago

When given data and parameters, the memory usage is fixed. The program detects total RAM but won't make tradeoff between RAM and runtime.

Sabryr commented 3 years ago

Thank you for the answer. I am setting up wtdbg on our HPC cluster. The processing is submitted as a job. Each job should specify how much resources needed. For example:

SBATCH --cpus-per-task=4

SBATCH --mem-per-cpu=256G

However when the job is submitted, wtdbg2 detect all the resources in the compute node and plan accordingly. -- total memory 3170070156.0 kB -- available 2944877052.0 kB -- 128 cores (I found -t option to limit the number of cores to be used, but as you say this is not possible for memory)

A user has circumvented this by occupying the whole node with all resources. This results in monitoring scripts reports enormous e resource wastage. I am trying to find a solution for this as your program seems to be the only realistic option for his pacbio reads.
I have tested with sample data and I could not find a way to inform wtdbg2 about this job resources limitation.

In addition, wtdbg2 is writing to disk very-frequently. I see that this may be to avoid exceeding RAM limitations. At the same time on some of our nodes with about 3Tb RAM we prefer if the user could do more work on RAM and access disk less.

Could you help me to set this up so I can help to solve this limitations. I would gladly provide any assistance and also contribute back the findings.

ruanjue commented 3 years ago

Please ignore the message of RAM and cores, the only one option be affected is -t 0, where it means all cores, otherwise wtdbg2 run as itself regardless of how much of your resource. To avoid wtdbg2 writting too much information on your disk, you can add option --minimal-output. During the development of wtdbg2, I tends to use more RAM to speed it up instead of disk.

Sabryr commented 3 years ago

Thank you very much I will try this.

Sabryr commented 3 years ago

This is the comparison when using --minimal-output and before.

wtdbg2 -t 8 -x rs -X 32 -g 32g -L 5000 -i ${INPUT_FILE} -fo axolotl

JobName AllocCPUS Time
wtdbg2 8 00:55:45
MaxDiskWrite AveDiskWrite MaxRSS
1557.05M 1557.05M 43671364K
wtdbg2 -t 8 --minimal-output -x rs -X 32 -g 32g -L 5000 -i ${INPUT_FILE} -fo axolotl JobName AllocCPUS Time
wtdbg2 8 01:21:43
MaxDiskWrite AveDiskWrite MaxRSS
1358.22M 1358.22M 43668708K

--minimal-output makes it 30 min slower when everything is the same, with about 200Mb less average disk write.

I forked your repo , any recommendations on how can I test changing the disk write frequency ?

ruanjue commented 3 years ago

Thanks for the information. With --minimal-output, wtdbg2 only write the core compressed results once to disk.

Sabryr commented 3 years ago

--minimal-output processing becomes slower, for reasons I do not understand so it is not giving me the outcome I was expecting. Which is to do more work on memory and write to disk at the end. So the intention is, if I avoid --minimal-output would wtdbg2 stop processing until the chunk is fully written to disk ? or will the processing continue while the data being written. In my end when the IO is high CPU usage reduces. here - https://github.com/ruanjue/wtdbg2/blob/b77c5657c8095412317e4a20fe3668f5bde6b1ac/filewriter.h I see that you have implemented a parallel processing, but do you have any idea about my above observation ?

ruanjue commented 3 years ago

Please have a look at the usage of wtdbg2 wtdbg2 --help.

 --minimal-output
   Will generate as less output files (<--prefix>.*) as it can
Sabryr commented 3 years ago

I was able to use your software to optimally use our HPC setup using a sample of Axolotl data. Thank you for that help. However, now when handling the real genome. "-x sq -X 80 -g 7.5g -L 5000 " input size 1.7 Tb, it is going to take about 80 days on a single node. So I was wondering whether wtdgb2 can use multiple nodes (mpi) ?

ruanjue commented 3 years ago

Try -x rs -X 50 -g 7.5g for huge genome.