Open Sabryr opened 3 years ago
When given data and parameters, the memory usage is fixed. The program detects total RAM but won't make tradeoff between RAM and runtime.
Thank you for the answer. I am setting up wtdbg on our HPC cluster. The processing is submitted as a job. Each job should specify how much resources needed. For example:
However when the job is submitted, wtdbg2 detect all the resources in the compute node and plan accordingly. -- total memory 3170070156.0 kB -- available 2944877052.0 kB -- 128 cores (I found -t option to limit the number of cores to be used, but as you say this is not possible for memory)
A user has circumvented this by occupying the whole node with all resources. This results in monitoring scripts reports enormous e resource wastage. I am trying to find a solution for this as your program seems to be the only realistic option for his pacbio reads.
I have tested with sample data and I could not find a way to inform wtdbg2 about this job resources limitation.
In addition, wtdbg2 is writing to disk very-frequently. I see that this may be to avoid exceeding RAM limitations. At the same time on some of our nodes with about 3Tb RAM we prefer if the user could do more work on RAM and access disk less.
Could you help me to set this up so I can help to solve this limitations. I would gladly provide any assistance and also contribute back the findings.
Please ignore the message of RAM and cores, the only one option be affected is -t 0
, where it means all cores, otherwise wtdbg2 run as itself regardless of how much of your resource. To avoid wtdbg2 writting too much information on your disk, you can add option --minimal-output
. During the development of wtdbg2, I tends to use more RAM to speed it up instead of disk.
Thank you very much I will try this.
This is the comparison when using --minimal-output
and before.
wtdbg2 -t 8 -x rs -X 32 -g 32g -L 5000 -i ${INPUT_FILE} -fo axolotl
JobName | AllocCPUS | Time |
---|---|---|
wtdbg2 | 8 | 00:55:45 |
MaxDiskWrite | AveDiskWrite | MaxRSS |
---|---|---|
1557.05M | 1557.05M | 43671364K |
wtdbg2 -t 8 --minimal-output -x rs -X 32 -g 32g -L 5000 -i ${INPUT_FILE} -fo axolotl | JobName | AllocCPUS | Time |
---|---|---|---|
wtdbg2 | 8 | 01:21:43 |
MaxDiskWrite | AveDiskWrite | MaxRSS |
---|---|---|
1358.22M | 1358.22M | 43668708K |
--minimal-output
makes it 30 min slower when everything is the same, with about 200Mb less average disk write.
I forked your repo , any recommendations on how can I test changing the disk write frequency ?
Thanks for the information. With --minimal-output
, wtdbg2 only write the core compressed results once to disk.
--minimal-output
processing becomes slower, for reasons I do not understand so it is not giving me the outcome I was expecting. Which is to do more work on memory and write to disk at the end. So the intention is, if I avoid --minimal-output
would wtdbg2 stop processing until the chunk is fully written to disk ? or will the processing continue while the data being written. In my end when the IO is high CPU usage reduces.
here - https://github.com/ruanjue/wtdbg2/blob/b77c5657c8095412317e4a20fe3668f5bde6b1ac/filewriter.h I see that you have implemented a parallel processing, but do you have any idea about my above observation ?
Please have a look at the usage of wtdbg2 wtdbg2 --help
.
--minimal-output
Will generate as less output files (<--prefix>.*) as it can
I was able to use your software to optimally use our HPC setup using a sample of Axolotl data. Thank you for that help. However, now when handling the real genome. "-x sq -X 80 -g 7.5g -L 5000 " input size 1.7 Tb, it is going to take about 80 days on a single node. So I was wondering whether wtdgb2 can use multiple nodes (mpi) ?
Try -x rs -X 50 -g 7.5g
for huge genome.
Is it possible to specify amount of memory (RAM) to be used instead of automatically detecting the amount of RAM?