vpc-ccg / sedef

Identification of segmental duplications in the genome
MIT License
26 stars 8 forks source link

Align buckets failing with "Error: vector::_M_default_append" #17

Closed mrvollger closed 4 years ago

mrvollger commented 4 years ago

I am consistently getting 1-2 jobs failing in the align bucket stage with a Error: vector::_M_default_append error. Any advice, or any output I can share to help track it down? Below are the log files for the failed jobs:


🍁  🐚    SEDEF 1.1-25-g1142a36-dirty; arguments:  (SSE4.1) sedef align generate -k 11 /tmp/mvollger/sedef/sedef_chm13.20200611.fasta chm13.20200611_sedef_out/align/bucket_0717
Read 2701 alignments in chm13.20200611_sedef_out/align/bucket_0717
Read total 2701 alignments
Detected FASTA translation index for /tmp/mvollger/sedef/sedef_chm13.20200611.fasta
Using k-mer size 11
 Processing 2701 out of 2701 (100.0%, len 27,544,704 to 16,271,353)Error: vector::_M_default_append
TIMING: 1034.58 38026708
🍁  🐚    SEDEF 1.1-25-g1142a36-dirty; arguments:(SSE4.1) sedef align generate -k 11 /tmp/mvollger/sedef/sedef_chm13.20200611.fasta chm13.20200611_sedef_out/align/bucket_0718
Read 2701 alignments in chm13.20200611_sedef_out/align/bucket_0718
Read total 2701 alignments
Detected FASTA translation index for /tmp/mvollger/sedef/sedef_chm13.20200611.fasta
Using k-mer size 11
 Processing 2701 out of 2701 (100.0%, len 34,981,350 to 32,192,521)Error: vector::_M_default_append
TIMING: 861.34 44675272

Thanks! Mitchell

inumanag commented 4 years ago

Hi @mrvollger

Seems like a memory issue... how much available memory do you have?

mrvollger commented 4 years ago

I am running on a machine with 2.5TB of ram so I don't think I am out of space. It is possible that it cannot find a very large continuous space in mem, but seems unlikely to me.

I tried rerunning with only one job at a time to see if I could rescue those specific failures and got the same error again.

Thanks! Mitchell

inumanag commented 4 years ago

Which genome are you using? We observed similar issues with alignments >= 2MB (very rare cases). If you could share the bucket and the genome, I can take a look.

mrvollger commented 4 years ago

It is an assembly of the human genome of CHM13. It has centromeric, HSAT and other sequence that is largely missing from hg38 and hg19 which might be causing issues. But I RepeatMasked and used TRF to mask and I think that would remove those regions.

sedef output dir: https://eichlerlab.gs.washington.edu/help/mvollger/share/chm13.20200611_2_sedef_out.tar.gz Genome assembly: https://eichlerlab.gs.washington.edu/help/mvollger/share/chm13.20200611_2_masked.fasta.gz These are the two buckets that failed:

chm13.20200611_2_sedef_out/log/align/bucket_0636.log
chm13.20200611_2_sedef_out/log/align/bucket_0637.log

Thanks for taking a look!

mrvollger commented 4 years ago

Hi Inumanag,

I was just wondering if you have had a chance to look at this.

Thanks! Mitchell

inumanag commented 4 years ago

Found the issue: it seems that the first part of the pipeline considers translated chromosomes as "normal" chromosomes and does not stop at the internal gaps, thus leaving the aligner to cope with ~30MB regions (and aligner just crashes here because these regions are highly repetitive). Will take me few days to fix this.

How important are those small contigs (they seem very repetitive?). You can filter them out and try running the pipeline again as a temporary fix.

mrvollger commented 4 years ago

Got it. They are less important so I may do that as a temp fix. Can you identify any of these contigs for me that cause problems? I expected small satellite contigs but those should have been ~100% masked so I thought they would not have an impact.

inumanag commented 4 years ago

Seems that utg* contigs are causing the trouble.

mrvollger commented 4 years ago

@inumanag I have another question. I know you suggested removing these contigs but as it turns out I need them. Do you think I could address this problem in a hacked way by: splitting them up so they are not in order and therefore do not make one contiguous stretch of repeats in the translated.fa, or by inserting random contigs that will separate them a bit in translated.fa?

inumanag commented 4 years ago

Hi @mrvollger

Can you please try branch #18 ? This one has it fixed. FYI it also drops the support for -t parameter because the translation is now done automatically. Let me know if you have any issues with it; if not, I can merge it later to the master branch.

mrvollger commented 4 years ago

Getting an error at the aln step. Looks like it is only making one job. I am sharing the output here but I will also be looking into it a bit more myself.

sedef output https://eichlerlab.gs.washington.edu/help/mvollger/share/out.tar.gz

sedef input: https://eichlerlab.gs.washington.edu/help/mvollger/share/chm13.20200611_masked.fasta

Start: Wed Jul 15 14:34:40 PDT 2020
SEDEF: FASTA=/tmp/mvollger/sedef/sedef_chm13.20200611_2.fasta; output=chm13.20200611_2_sedef_out; jobs=80; force=y
Output file name chm13.20200611_2_sedef_out exists! Removing it.
************************************************************************
Running SD seeding...
0% 1:991=991s /usr/bin/time -f'TIMING: %e %M' sedef search -k 12 -w 16  /tmp/mvollger/sedef/sedef_chm13.20200611_2.fasta -t 10 1 >chm13.20200611_2_sedef_out/seeds/10_1_n.bed 2>chm
0% 2:990=990s /usr/bin/time -f'TIMING: %e %M' sedef search -k 12 -w 16 -r /tmp/mvollger/sedef/sedef_chm13.20200611_2.fasta -t 10 1 >chm13.20200611_2_sedef_out/seeds/10_1_y.bed 2>c
99% 985:7=6s /usr/bin/time -f'TIMING: %e %M' sedef search -k 12 -w 16 -r /tmp/mvollger/sedef/sedef_chm13.20200611_2.fasta -t 30 30 >chm13.20200611_2_sedef_out/seeds/30_30_y.bed 2>
99% 986:6=6s /usr/bin/time -f'TIMING: %e %M' sedef search -k 12 -w 16 -r /tmp/mvollger/sedef/sedef_chm13.20200611_2.fasta -t 30 30 >chm13.20200611_2_sedef_out/seeds/30_30_y.bed 2>
99% 986:6=6s /usr/bin/time -f'TIMING: %e %M' sedef search -k 12 -w 16 -r /tmp/mvollger/sedef/sedef_chm13.20200611_2.fasta -t 30 30 >chm13.20200611_2_sedef_out/seeds/30_30_y.bed 2>
99% 986:6=6s /usr/bin/time -f'TIMING: %e %M' sedef search -k 12 -w 16 -r /tmp/mvollger/sedef/sedef_chm13.20200611_2.fasta -t 30 30 >chm13.20200611_2_sedef_out/seeds/30_30_y.bed 2>
100% 992:0=0s /usr/bin/time -f'TIMING: %e %M' sedef search -k 12 -w 16 -r /tmp/mvollger/sedef/sedef_chm13.20200611_2.fasta -t 30 30 >chm13.20200611_2_sedef_out/seeds/30_30_y.bed 2
Seeding time: 41:57.17
SD seeding done: done running 992 jobs!
Single-core running time: 15 hours (54136.8 seconds)
Memory used: 2262 MB
************************************************************************
Running SD alignment...
************************************************************************
Running SD alignment...
100% 1:0=0s /usr/bin/time -f'TIMING: %e %M' sedef align generate -k 11 "/tmp/mvollger/sedef/sedef_chm13.20200611_2.fasta" chm13.20200611_2_sedef_out/align/bucket_???? >chm13.20200
Aligning time: 0:00.31
SD alignment done: finished 1 jobs!
Error: launched 1 jobs but completed only 0 jobs; exiting...
mrvollger commented 4 years ago

Might be an issue with the buckets. ???? where I think there should be numbers.

$cat  cat align.comm
/usr/bin/time -f'TIMING: %e %M' sedef align generate -k 11 "/tmp/mvollger/sedef/sedef_chm13.20200611_2.fasta" chm13.20200611_2_sedef_out/align/bucket_???? >chm13.20200611_2_sedef_out/align/bucket_????.aligned.bed 2>chm13.20200611_2_sedef_out/log/align/bucket_????.log

and the bucket command failed I think

cat bucket.log
🍁  🐚    SEDEF 1.1-28-g69b6ddb; arguments:  (SSE4.1) sedef align bucket -n 1000 chm13.20200611_2_sedef_out/seeds chm13.20200611_2_sedef_out/align
Error: Cannot open file
Double-check the parameters: run sedef --help for explanation.
Bucketing time: 0:00.00
inumanag commented 4 years ago

Please try the latest commit. Forgot to update sedef.sh ...

inumanag commented 4 years ago

Oh, and make sure to remove align* inside the working directory.

mrvollger commented 4 years ago

Running, will update soon.

mrvollger commented 4 years ago

The alignment jobs now work but final.bed is empty so something down stream may have been broken in the updates.

https://eichlerlab.gs.washington.edu/help/mvollger/share/chm13.20200611_2_sedef_out/

inumanag commented 4 years ago

Hi @mrvollger,

Weird--- I get a nice final.bed with ~300k entries (on macOS). Your aligned.bed file looks right, so I guess the problem is in the filtering stage. Mind running the following command manually?

sedef stats generate eichlerlab.gs.washington.edu/help/mvollger/share/chm13.20200611_masked.fasta eichlerlab.gs.washington.edu/help/mvollger/share/chm13.20200611_2_sedef_out/aligned.bed > _test

If _test is not empty, try getting the latest commit and running sedef.sh again without -f (it should just redo the last step).

mrvollger commented 4 years ago

Hmm, it fails for me with the latest version:

sedef stats generate ../chm13.20200611_2_masked.fasta aligned.bed > _test
🍁  🐚    SEDEF 1.1-30-g82fa6d3; arguments:  (SSE4.1) sedef stats generate ../chm13.20200611_2_masked.fasta aligned.bed
Processed hit 1 out of 909,784terminate called after throwing an instance of 'fmt::FormatError'
  what():  argument index out of range
Aborted (core dumped)

Reran without -f just incase and once again got an empty bed. In parallel I am going to start a new run in a new dir in case I made a mistake.

I really appreciate you debugging with me...

inumanag commented 4 years ago

Thanks a lot--- got it now. Will push new version in a few minutes. Apologies for back and forth--- seems that fmt library does not raise an exception on macOS for some reason...

inumanag commented 4 years ago

Could you please try it now (make sure to recompile)? Just the last step is OK. No need to rerun the whole pipeline.

mrvollger commented 4 years ago

Thanks, fingers crossed. It has processed >1000 hits without error so looks good so far. Will update when it is done.

mrvollger commented 4 years ago

Finished running and generated _test without error. _test is essentially an unsorted final.bed right?

inumanag commented 4 years ago

Awesome. Yes--- and you should be able to complete sedef.sh now and get final.bed.

mrvollger commented 4 years ago

Finished!

rm chm13.20200611_2_sedef_out/report.joblog.ok ; sedef.sh -o chm13.20200611_2_sedef_out -j 80 chm13.20200611_2_masked.fasta                     Start: Sat Jul 18 13:07:13 PDT 2020
SEDEF: FASTA=chm13.20200611_2_masked.fasta; output=chm13.20200611_2_sedef_out; jobs=80; force=n
Output file name chm13.20200611_2_sedef_out exists! Please delete chm13.20200611_2_sedef_out or run with -f/--force if you want to start anew.
************************************************************************
************************************************************************
************************************************************************
************************************************************************
Running SD reporting...

Processed hit 909,784 out of 909,784... done!
Report time: 33:59.97 (3544908 MB, user 922.38)
Line counts:
    909784 chm13.20200611_2_sedef_out/aligned.bed
    297175 chm13.20200611_2_sedef_out/final.bed
   7266320 chm13.20200611_2_sedef_out/potentials.bed
         1 chm13.20200611_2_sedef_out/SDs.browser.bed
         1 chm13.20200611_2_sedef_out/SDs.lowid.browser.bed
   7850879 chm13.20200611_2_sedef_out/seeds.bed
  16324160 total
End: Sat Jul 18 13:48:44 PDT 2020
************************************************************************
************************************************************************
SEDEF done! Final SDs available in chm13.20200611_2_sedef_out/final.bed

Thanks so much.

mrvollger commented 4 years ago

Noticed one more thing that was happening with a different run and no matter what the final.bed was empty. Turns out I had a sedef-translate index the fasta dir still and the code will try to use it if it is there resulting in an empty final.bed.

src/fasta.cc:    eprn("Detected FASTA translation index for {}", filename);

You may want to remove that bit before merging with master.

inumanag commented 4 years ago

Thanks--- will do!