rlorigro / GFAse

Tool for globally phasing diploid assembly graphs with orthogonal data
Mozilla Public License 2.0
36 stars 4 forks source link

terminate called after throwing an instance of 'std::out_of_range' #23

Closed csuzhang closed 4 months ago

csuzhang commented 8 months ago

Hi: I'm running GFAse on Shasta assembly. The program is terminated when unzipping chains. Here is the log: Final unmerged score: 1.52998e+06 [0h 39m 24s] Writing phasing results to file... [0h 39m 24s] Chaining homologous sequences... [0h 39m 26s] Writing GFA... [0h 39m 33s] Unzipping chains... terminate called after throwing an instance of 'std::out_of_range' what(): at: key not present Command terminated by signal 6

jeizenga commented 8 months ago

Are you able to share the data that led to the crash?

csuzhang commented 8 months ago

The Hi-C data are Hi-C 1 and Hi-C 2. The Shasta assembly is uploaded to Google Drive Shasta assembly.

rlorigro commented 8 months ago

Hi @csuzhang, we are hoping to get to this soon. In the mean time, could you also try invoking --use_simple_chainer and see if the error persists?

rlorigro commented 8 months ago

Hi @csuzhang could you also share the command that you ran?

csuzhang commented 8 months ago

bwa index Assembly-Phased.fasta bwa mem -t 48 -5 -SP Assembly-Phased.fasta HG002.HiC_2_NovaSeq_rep1_run2_S1_L001_R1_001.fastq.gz HG002.HiC_2_NovaSeq_rep1_run2_S1_L001_R1_001.fastq.gz | samtools sort -n -@ 47 - -o hic_to_assembly.sorted_by_read.bam phase_contacts_with_monte_carlo -i hic_to_assembly.sorted_by_read.bam -g Assembly-Phased.gfa -o gfase --skip_unzip -m 1 -t 48

jeizenga commented 8 months ago

Are these the exact commands you used? If so, I think the issue might have been the bwa mem command. For one, it looks like you used the R1 reads twice instead of both R1 and R2. It's still not good that we crash on this input, but the problem might resolve if you fix the input.

rlorigro commented 8 months ago

I think there must be a mismatch with the command that was run and the command posted above, because the stderr says [0h 39m 33s] Unzipping chains... but in the executable that would not appear if --skip_unzip was invoked:

    if (not skip_unzip) {
        cerr << t << "Unzipping chains... " << '\n';

        unzip(graph, id_map, overlaps, false, false);
        write_gfa_to_file(graph, id_map, overlaps, unzipped_gfa_path);
    }
csuzhang commented 8 months ago

I'm sorry that I made a mistake. The command I used is bwa index Assembly-Phased.fasta bwa mem -t 48 -5 -SP Assembly-Phased.fasta HG002.HiC_2_NovaSeq_rep1_run2_S1_L001_R1_001.fastq.gz HG002.HiC_2_NovaSeq_rep1_run2_S1_L001_R2_001.fastq.gz | samtools sort -n -@ 47 - -o hic_to_assembly.sorted_by_read.bam phase_contacts_with_monte_carlo -i hic_to_assembly.sorted_by_read.bam -g Assembly-Phased.gfa -o gfase -m 1 -t 48

rlorigro commented 8 months ago

Ok thanks.

Were you also able to try invoking --use_simple_chainer? Since this is a Shasta graph you probably won't see a huge benefit from the hamiltonian chainer (which is the new default).

jeizenga commented 8 months ago

I have processed the data and unfortunately can't replicate this crash. However, I aligned with BWA command before you posted your command, which is slightly different. I will try to replicate it exactly to see if that accounts for the discrepancy. Alternatively, you could provide your BAM files directly. Can you also check which commit/release of GFAse you're using?

jeizenga commented 7 months ago

I was able to replicate the crash by using the exact read mapping command that you did, and it should now be resolved.

csuzhang commented 5 months ago

I'm sorry it took so long to reply to you. I have tried the newest version. The output of phase_0 and phase_1 is about 18M, and unphased is about 5.6G. The command I used: phase_contacts_with_monte_carlo -i hic_to_assembly.sorted_by_read.bam -g Assembly-Phased.gfa -o gfase -m 1 -t 48

rlorigro commented 5 months ago

Hi @csuzhang we have started looking at this again. To help speed things up could you please do the following:

  1. share the command that you ran to align the reads
  2. share the chained.gfa and the contacts.csv that was generated in your output directory
  3. also try rerunning with --use_simple_chainer to see if that resolves the problem (we will use this information to narrow down the bug)

Thank you

csuzhang commented 5 months ago
  1. The command I ran to align the reads: bwa mem -t 48 -5 -SP Assembly-Phased.fasta HG002.HiC_2_NovaSeq_rep1_run2_S1_L001_R1_001.fastq.gz HG002.HiC_2_NovaSeq_rep1_run2_S1_L001_R2_001.fastq.gz | samtools sort -n -@ 48 - -o hic_to_assembly.sorted_by_read.bam
  2. chained.gfa; contacts.csv.gz
  3. When reruning with --use_simple_chainer, the output of phase_0 and phase_1 is about 2.5G and unphased is about 370M.
rlorigro commented 5 months ago

Thank you @csuzhang this is very helpful, we will work on this. In the mean time hopefully the simple chainer will get you what you need

rlorigro commented 5 months ago

@csuzhang could you also please share the phases.csv that you obtained from your original problematic run? It may help us debug faster

csuzhang commented 5 months ago

@rlorigro the files uploaded before may be wrong, here is the correct files contacts.csv.gz, phases.csv, chained.gfa, Assembly-Phased.gfa.

rlorigro commented 5 months ago

Ok thanks

jeizenga commented 5 months ago

@csuzhang For the Google Drive links, would you mind giving access to either joeizeng@gmail.com or jeizenga@ucsc.edu?

jeizenga commented 4 months ago

I believe this bug should now be fixed in the main branch. Thanks for pointing this issue out!