morispi / HG-CoLoR

Hybrid method based on a variable-order de bruijn Graph for the error Correction of Long Reads
GNU Affero General Public License v3.0
9 stars 2 forks source link

Disk quota exceeded when there is enough space on disk #8

Closed haowenz closed 5 years ago

haowenz commented 5 years ago

Hi,

I run HG-CoLoR on my data set and got the following error. It is pretty weird since I do have enough disk space.

The length of my short reads is 150bp. And the sequencing depth is 373x. That's why I set 'bestn' to 20 and 'solid' to 2 as what was suggested in other issues.

/project/HG-CoLoR/HG-CoLoR --longreads /ecoli_ont_2D_tmp/ecoli_ont_2D.fasta --shortreads /ecoli/miseq/ecoli.fastq --out /ecoli_ont_2D_output/hg-color_output/corrected_ecoli_ont_2D.fasta -K 135 --nproc 28 --kmcmem 64 --bestn 20 --solid 2 --tmpdir /ecoli_ont_2D_tmp/hg-color_tmp [Wed Dec 12 20:37:10 EST 2018] Correcting the short reads [Wed Dec 12 20:38:50 EST 2018] Removing short reads containing weak K-mers [Wed Dec 12 20:42:15 EST 2018] Building the graph [Wed Dec 12 20:44:15 EST 2018] Preparing the raw long reads temporary files [Wed Dec 12 20:44:32 EST 2018] Aligning the short reads on the long reads [Wed Dec 12 21:33:01 EST 2018] Preparing the alignments temporary files Traceback (most recent call last): File "/project/HG-CoLoR/bin/filterOutShortAlignments.py", line 32, in out = open(sys.argv[3] + curFile, "w") OSError: [Errno 122] Disk quota exceeded: '/ecoli_ont_2D_tmp/hg-color_tmp/HGC_29225/Alignments/15951_2783'

morispi commented 5 years ago

Hi,

Sorry for taking so long to answer.

Pretty weird error indeed. Only time I ran into that problem was when I, indeed, had not enough disk space to store the alignments.

What HG-CoLoR install are you using? Conda or cloning / compiling the git repo?

Also, could you please provide me the first few lines of your SR/LR alignments file? It should be located in the tmp directory you've chosen, under SR_on_LR.sam. Could help me to see if there's any issue with your reads ids.

Pierre

haowenz commented 5 years ago

I did some investigation several weeks ago. It seems that HG-CoLoR generated so many files during the correction (for each long/short read and alignment between them I guess?). There is a limit on the number of files on the sever I used to run HG-CoLoR. And I ran HG-CoLoR on several data sets at the same time. I guess that's why I got quota exceeded error. It would be helpful if all these intermediate files could go into a small number of files.

BTW, I noticed that HG-CoLoR will split the read ids with stroke or space (I cannot remember the exact splitter.). So I renamed the reads and dropped all the comments as well.

For installation, I installed Emboss and QuorUM using Conda and then downloaded and built HG-CoLoR from the repo.

Another problem is that I got segfault randomly on some data sets during the step "Generating the corrected long reads". One of them is as follows:

[Mon Dec 17 18:52:03 EST 2018] Correcting the short reads [Mon Dec 17 18:52:42 EST 2018] Removing short reads containing weak K-mers [Mon Dec 17 18:54:45 EST 2018] Building the graph [Mon Dec 17 18:57:41 EST 2018] Preparing the raw long reads temporary files [Mon Dec 17 18:58:42 EST 2018] Aligning the short reads on the long reads [Mon Dec 17 20:14:22 EST 2018] Preparing the alignments temporary files [Mon Dec 17 21:03:31 EST 2018] Generating the corrected long reads /project/HG-CoLoR/HG-CoLoR: line 254: 19493 Segmentation fault (core dumped) $hgf/bin/CLRgen -t "$tmpdir" -K "$K" -d "$seedsdistance" -o "$seedsoverlap" -k "$k" -b "$branches" -s "$seedskips" -m "$mismatches" -j "$nproc" $tmpdir/"$K-mers.fa.pgsa" > "$out.fasta"

Sometimes It could still generate some corrected reads but much less than the original data set, which indicates HG-CoLoR interrupted at some point in that step. Since this seems to be a random error, I am not sure whether you could reproduce it on your machine. But I could send you some data if you want to investigate it a bit more.

Haowen

morispi commented 5 years ago

Hi,

Indeed, HG-CoLoR creates one file per LR, and one file per SR/LR alignment. I mainly processed as so to avoid extreme RAM usage, but didn't think it could cause such problems. I guess I could easily get rid of this constraint by loading all the LRs into memory using a 2 bits encoding (that wouldn't affect the RAM usage too much), and rework my multithreadng process so that it doesn't need to "explode" the alignment file into multiple files.

For the read ids with stroke or space, this is due to BLASR's behavior. It it the tool responsible for the splitting, and I cannot do much to avoid it, except adding a script at the beginning of the HG-CoLoR pipeline that would reformat the LRs.

About the segfault, my best guess is that you exceeded your pile size (the algorithm uses a lot of backtracking). It also happened to me on a few datasets. You could probably get rid of that segfault by changing your pile size with eg: ulimit -s 65536.

I've also just seen and quickly went through your preprint on BioRxiv. If you are willing to re-run the experiments on which HG-CoLoR failed, please do tell me so I can quickly fix the problems you went through. :)

Best, Pierre

haowenz commented 5 years ago

Thanks for the reply.

We have submitted the manuscript. But if you could fix the problems, I could try to run it and may add new results in a revised version later.

Thanks, Haowen

morispi commented 5 years ago

Hi Haowen,

Just released v1.1 of HG-CoLoR.

Took your issue into account, and no more temporary files are created. You should now be fine to run HG-CoLoR if you still wish to. Also added a line in the main script to increase the max size of the pile, which should help with the segfaults you encounter.

Cheers, Pierre

haowenz commented 5 years ago

Got the following error:

HG-CoLoR: line 236: ulimit: stack size: cannot modify limit: Operation not permitted

I guess the limit cannot be changed on my machine.