google / deepconsensus

DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.
BSD 3-Clause "New" or "Revised" License
229 stars 36 forks source link

Error (ccs software) - No space left on device (tmp file). #19

Closed AMMMachado closed 2 years ago

AMMMachado commented 2 years ago

Dear @pichuan,

Using the docker system, I installed and run the tests successfully on the second version of the software in our cluster. Now, during the tests with real data, we found some issues in the ccs software. Usually, we run the software in the nodes, and the output it's printed in the front-end. The front-end has more than 1PB of space, while the nodes only have +-60gb. The tmp files seem to be saved in node, right? It's possible to relocate these temp files to another path?

Below, you can consult the error.

| 20220123 11:27:44.871 | FATAL | Could not write BAM record to /tmp/13552.1.all.q/thread.7_0.ccs.bam | 20220123 11:27:44.982 | FATAL | Caught existing deep IO exception, ignoring thread 13 | 20220123 11:27:44.985 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.985 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.986 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.986 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.986 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.986 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.986 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.986 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.986 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.986 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.986 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.986 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.986 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.986 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.987 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.987 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.987 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.987 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.987 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.987 | FATAL | Previous exception in Stage DraftPolish. Pumping buffers empty! | 20220123 11:27:44.988 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.988 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.988 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.988 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.988 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.988 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.988 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.988 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.988 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.988 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.988 | FATAL | Exception thrown in CCSWF | 20220123 11:27:44.988 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.988 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.988 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.989 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.989 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.989 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.989 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.989 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.989 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.989 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.989 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.989 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.989 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.989 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.989 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.989 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.989 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.989 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.989 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.989 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.989 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.989 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.989 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.990 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.990 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.990 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.990 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.990 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.990 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.990 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.990 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.990 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:44.990 | FATAL | Previous exception in DraftStage, aborting thread 13 | 20220123 11:27:45.283 | FATAL | Previous exception in DraftStage, aborting thread 9 | 20220123 11:27:45.285 | FATAL | Previous exception in DraftStage, aborting thread 9 | 20220123 11:27:45.405 | FATAL | Previous exception in DraftStage, aborting thread 5 | 20220123 11:27:45.406 | FATAL | Previous exception in DraftStage, aborting thread 5 | 20220123 11:27:45.669 | FATAL | Previous exception in DraftStage, aborting thread 0 | 20220123 11:27:45.672 | FATAL | Previous exception in DraftStage, aborting thread 0 | 20220123 11:27:45.684 | FATAL | Previous exception in DraftStage, aborting thread 14 | 20220123 11:27:45.686 | FATAL | Previous exception in DraftStage, aborting thread 14 | 20220123 11:27:45.864 | FATAL | Previous exception in DraftStage, aborting thread 4 | 20220123 11:27:45.868 | FATAL | Previous exception in DraftStage, aborting thread 4 | 20220123 11:27:45.955 | FATAL | Previous exception in DraftStage, aborting thread 28 | 20220123 11:27:45.957 | FATAL | Previous exception in DraftStage, aborting thread 28 | 20220123 11:27:46.128 | FATAL | Previous exception in DraftStage, aborting thread 29 | 20220123 11:27:46.130 | FATAL | Previous exception in DraftStage, aborting thread 29 | 20220123 11:27:46.157 | FATAL | Previous exception in DraftStage, aborting thread 3 | 20220123 11:27:46.159 | FATAL | Previous exception in DraftStage, aborting thread 3 | 20220123 11:27:46.223 | FATAL | Previous exception in DraftStage, aborting thread 6 | 20220123 11:27:46.224 | FATAL | Previous exception in DraftStage, aborting thread 6 | 20220123 11:27:46.293 | FATAL | Previous exception in DraftStage, aborting thread 24 | 20220123 11:27:46.296 | FATAL | Previous exception in DraftStage, aborting thread 24 | 20220123 11:27:46.549 | FATAL | Previous exception in DraftStage, aborting thread 16 | 20220123 11:27:46.551 | FATAL | Previous exception in DraftStage, aborting thread 16 | 20220123 11:27:46.867 | FATAL | Previous exception in DraftStage, aborting thread 7 | 20220123 11:27:46.868 | FATAL | Previous exception in DraftStage, aborting thread 7 | 20220123 11:27:46.894 | FATAL | Previous exception in DraftStage, aborting thread 15 | 20220123 11:27:46.897 | FATAL | Previous exception in DraftStage, aborting thread 15 | 20220123 11:27:46.959 | FATAL | Previous exception in DraftStage, aborting thread 20 | 20220123 11:27:46.963 | FATAL | Previous exception in DraftStage, aborting thread 20 | 20220123 11:27:47.092 | FATAL | Previous exception in DraftStage, aborting thread 8 | 20220123 11:27:47.095 | FATAL | Previous exception in DraftStage, aborting thread 8 | 20220123 11:27:47.176 | FATAL | Previous exception in DraftStage, aborting thread 27 | 20220123 11:27:47.177 | FATAL | Previous exception in DraftStage, aborting thread 27 | 20220123 11:27:47.538 | FATAL | Previous exception in DraftStage, aborting thread 23 | 20220123 11:27:47.542 | FATAL | Previous exception in DraftStage, aborting thread 23 | 20220123 11:27:47.653 | FATAL | Previous exception in DraftStage, aborting thread 21 | 20220123 11:27:47.655 | FATAL | Previous exception in DraftStage, aborting thread 19 | 20220123 11:27:47.657 | FATAL | Previous exception in DraftStage, aborting thread 19 | 20220123 11:27:47.658 | FATAL | Previous exception in DraftStage, aborting thread 21 | 20220123 11:27:47.731 | FATAL | Previous exception in DraftStage, aborting thread 10 | 20220123 11:27:47.733 | FATAL | Previous exception in DraftStage, aborting thread 10 | 20220123 11:27:47.781 | FATAL | Previous exception in DraftStage, aborting thread 1 | 20220123 11:27:47.783 | FATAL | Previous exception in DraftStage, aborting thread 1 | 20220123 11:27:47.789 | FATAL | Previous exception in DraftStage, aborting thread 17 | 20220123 11:27:47.791 | FATAL | Previous exception in DraftStage, aborting thread 17 | 20220123 11:27:47.915 | FATAL | Previous exception in DraftStage, aborting thread 18 | 20220123 11:27:47.916 | FATAL | Previous exception in DraftStage, aborting thread 18 | 20220123 11:27:47.933 | FATAL | Previous exception in DraftStage, aborting thread 22 | 20220123 11:27:47.934 | FATAL | Previous exception in DraftStage, aborting thread 22 | 20220123 11:27:48.426 | FATAL | Previous exception in DraftStage, aborting thread 12 | 20220123 11:27:48.427 | FATAL | Previous exception in DraftStage, aborting thread 12 | 20220123 11:27:48.475 | FATAL | Previous exception in DraftStage, abortinccsg thread 26 | 20220123 11:27:48.478 | FATAL | Previous exception in DraftStage, aborting thread 26 | 20220123 11:27:50.215 | FATAL | Previous exception in DraftStage, aborting thread 25 | 20220123 11:27:50.218 | FATAL | Previous exception in DraftStage, aborting thread 25 | 20220123 11:27:50.302 | FATAL | Previous exception in DraftStage, aborting thread 11 | 20220123 11:27:50.305 | FATAL | Previous exception in DraftStage, aborting thread 11 | 20220123 11:27:51.195 | FATAL | Previous exception in DraftStage, aborting thread 2 | 20220123 11:27:51.199 | FATAL | Previous exception in DraftStage, aborting thread 2 | 20220123 11:27:52.068 | FATAL | ccs ERROR: [pbbam] BAM writer ERROR: could not write record: file: /tmp/13552.1.all.q/thread.7_0.ccs.bam.tmp reason: No space left on device

Best Regard

André

pichuan commented 2 years ago

Hi @AMMMachado , given that the question is about ccs, maybe looking at ccs.how could be helpful. I'll also tag @armintoepfer to see if have some advice.

And, for the ccs step, I wonder if it'll be helpful if you try out the --chunk option.

armintoepfer commented 2 years ago

You should set your TMPDIR correctly. It's the canonical env variable in Linux and POSIX.

AMMMachado commented 2 years ago

Hi @pichuan and @armintoepfer,

Thank you for the help. After setting the TMPDIR correctly, the ccs problem was solved. In the next step with actc, we found another issue. Basically, the program ends with no output. I have re-checked the pbi file with pbindex and all seems ok. Can it be a memory issue? Using the --log-level DEBUG and --log-file parameters we obtained : .... | 20220201 18:59:07.989 | DEBUG | CLR : m54336U_210430_050755/67863/37072_49434 | 20220201 18:59:07.989 | DEBUG | CLR : m54336U_210430_050755/67863/49478_61841 | 20220201 18:59:07.989 | DEBUG | CLR : m54336U_210430_050755/67863/61890_74158 | 20220201 18:59:07.989 | DEBUG | CLR : m54336U_210430_050755/67863/74202_86793 | 20220201 18:59:07.989 | DEBUG | CLR : m54336U_210430_050755/67863/86839_99363 | 20220201 18:59:07.989 | DEBUG | CLR : m54336U_210430_050755/67863/99403_111371 | 20220201 18:59:07.989 | DEBUG | CLR : m54336U_210430_050755/67863/111416_123884 | 20220201 18:59:07.989 | DEBUG | CLR : m54336U_210430_050755/67863/123928_135800 | 20220201 18:59:07.989 | DEBUG | CLR : m54336U_210430_050755/67863/135845_147860 | 20220201 18:59:07.989 | DEBUG | CLR : m54336U_210430_050755/67863/147903_159727 | 20220201 18:59:07.990 | DEBUG | CLR : m54336U_210430_050755/67863/159773_171810 | 20220201 18:59:07.990 | DEBUG | CLR : m54336U_210430_050755/67863/171856_183907 | 20220201 18:59:07.990 | DEBUG | CLR : m54336U_210430_050755/67863/183949_196222 | 20220201 18:59:07.990 | DEBUG | CLR : m54336U_210430_050755/67863/196267_200986 | 20220201 18:59:07.990 | DEBUG | CCS : m54336U_210430_050755/67864/ccs | 20220201 18:59:07.990 | DEBUG | CLR : m54336U_210430_050755/67864/0_4256 | 20220201 18:59:07.990 | DEBUG | CLR : m54336U_210430_050755/67864/4303_17307 | 20220201 18:59:07.990 | DEBUG | CLR : m54336U_210430_050755/67864/17355_29860 | 20220201 18:59:07.990 | DEBUG | CLR : m54336U_210430_050755/67864/29905_37385

Additionally, we tried to use the --chunks option, but the user mode it's not clear. We have several machines with 32 cpus / 64gb ram, 32 cpus / 128gb ram and 32 cpus / 512gb. For both ccs and actc software, how should we use the chunk parameter? Should we use a config file with --chunk option setted to 1/10, 2/10, 3/10 ... etc ?

Best Regards Andre

pichuan commented 2 years ago

Hi @AMMMachado ,

Can you try again with the latest version of actc?

https://twitter.com/XLR/status/1488174497836511238 https://github.com/PacificBiosciences/actc

AMMMachado commented 2 years ago

Hi @pichuan,

Thank you. The new version of actc worked like a charm.

Regarding the --chunks option. For a *.subreads.bam dataset with 300gb the ccs and actc run in 24h, 32 CPUs 64GB ram with no chunks. The "deepconsensus run" with the defaults in 24h, 32 CPUs 64GB ram produced only 38mb of hifi.fq data. For both ccs and actc software, how should we use the chunk parameter? Should we use produce a config file with --chunk option set to 1/10, 2/10, 3/10 ... etc? André

AMMMachado commented 2 years ago

Hi @pichuan,

State point. In the last few days, we have optimized the software for our cluster. The chunks are no longer a problem, and all modules are working perfectly. We did 1000 shards for each full SMRT-cell, to use the maximum of our capacity. Right now the only bottleneck is the deep consensus runtime and our limited capacity. "Processed 1000 ZMWs in 4693.153982400894 seconds". Do you have any recommendations to decrease the runtime of deepconsensus?

André

pichuan commented 2 years ago

Thanks @AMMMachado for the update. Adding @MariaNattestad to give some advice on your latest number.

MariaNattestad commented 2 years ago

Hi @AMMMachado

The runtime you are reporting above is definitely higher than expected. See the runtime metrics page for runtimes we observed on different hardware configurations on GCP.

  1. This could be affected by factors like read length, though, so one thing you can try is to run the quick start and check the runtime of that on your system. What does that show? That will give us a hint whether any of the runtime difference you're seeing has to do with your inputs versus your compute setup.
  2. DeepConsensus produces a runtime.csv file alongside the output fastq, which will tell you which steps in DeepConsensus are taking the longest. This depends on your compute setup too, but on CPU alone we usually see run_model taking on the order of 80% of the runtime. Focus just on the stages that start with "batch" since this gives an overview.
  3. You can try changing --batch_zmws=100 in a small experiment to see if that helps -- it won't explain the whole discrepancy but could give you a small speedup if preprocessing is the slower step.

Thanks! Maria

AMMMachado commented 2 years ago

Hi @MariaNattestad,

In the last few days, we evaluated the performance of deepconsensus in our cluster, using your advice. We have checked the runtime of preprocessing and run_model modules. In all our tests the run_model took 85-95% of the total runtime.

We also identified 2 kinds of performances depending on the machine used.

Type 1 (32 CPUs/64Gb - Intel(R) Xeon(R) CPU E5-2650L 0 @ 1.80GHz) - In these machines, the jobs are faster and only use about 40-50% of the total CPUs.

Data of quick start: Processed 100 ZMWs in 238,841965 seconds Our datasets: Processed 100 ZMWs in 406.733319 seconds

Type 2 (32 CPUs/64Gb - AMD Opteron(tm) Processor 6262 HE) - In these machines, the jobs are slower and use about 90-95% of the total CPUs.

Data of quick start: Processed 100 ZMWs in 314,320736 seconds Our datasets: Processed 100 ZMWs in 662.089745 seconds

These results are coherent with the values above mention "Processed 1000 ZMWs in 4693.153982400894 seconds". Probably, in our case, the run time issue is a mix between our limited capacity and the features of the real dataset. Are expected these patterns of CPU usage per kind of machine?

Another important metric is memory used. We found a pattern of 10gb of ram used per 1000 ZMW's in all machines. May you can include the memory requirements per 100 or 1000 ZMW's in the quick start. Several users have local clusters/servers and this value it's important to calculate the number of shards per full SMRT-cell per machine.

We will run the deepconsensus for a full smart cell, check the stats and report here. Note: Our initial analysis suggests an increase of 20% of read coverage in relation to the default process of the company. In case of success, the total cost per genome can drastically decrease.

If you produce a fast version of deepconsensus please let me know. Thank you for all advices. Best Regards André

MariaNattestad commented 2 years ago

Thanks for this context and sharing the runtime numbers for various machines. I'll take your suggestion and make a note to see if we can do some memory usage profiling and provide more guidance on that.

As an update since parallelization was mentioned in the thread above, I just released a major change to the DeepConsensus quick start with detailed guidance for parallelization across multiple machines: https://github.com/google/deepconsensus/blob/r0.2/docs/quick_start.md

kishwarshafin commented 2 years ago

hi @AMMMachado ,

Please see the runtime metrics page for r0.3 with memory profiling included in the analysis. The --skip-windows-above parameter can help you adjust the desired runtime vs accuracy that you want to achieve. Please also see the change in processing step with --min-rq 0.88 parameter that reduces the amount of ZMWs that needs to be processed.

pichuan commented 2 years ago

Hi @AMMMachado , I'll close this issue now. Please feel free to reach out if you have more questions.