Closed csuzhang closed 4 months ago
Are you able to share the data that led to the crash?
The Hi-C data are Hi-C 1 and Hi-C 2. The Shasta assembly is uploaded to Google Drive Shasta assembly.
Hi @csuzhang, we are hoping to get to this soon. In the mean time, could you also try invoking --use_simple_chainer
and see if the error persists?
Hi @csuzhang could you also share the command that you ran?
bwa index Assembly-Phased.fasta
bwa mem -t 48 -5 -SP Assembly-Phased.fasta HG002.HiC_2_NovaSeq_rep1_run2_S1_L001_R1_001.fastq.gz HG002.HiC_2_NovaSeq_rep1_run2_S1_L001_R1_001.fastq.gz | samtools sort -n -@ 47 - -o hic_to_assembly.sorted_by_read.bam
phase_contacts_with_monte_carlo -i hic_to_assembly.sorted_by_read.bam -g Assembly-Phased.gfa -o gfase --skip_unzip -m 1 -t 48
Are these the exact commands you used? If so, I think the issue might have been the bwa mem
command. For one, it looks like you used the R1
reads twice instead of both R1
and R2
. It's still not good that we crash on this input, but the problem might resolve if you fix the input.
I think there must be a mismatch with the command that was run and the command posted above, because the stderr says [0h 39m 33s] Unzipping chains...
but in the executable that would not appear if --skip_unzip
was invoked:
if (not skip_unzip) {
cerr << t << "Unzipping chains... " << '\n';
unzip(graph, id_map, overlaps, false, false);
write_gfa_to_file(graph, id_map, overlaps, unzipped_gfa_path);
}
I'm sorry that I made a mistake. The command I used is
bwa index Assembly-Phased.fasta
bwa mem -t 48 -5 -SP Assembly-Phased.fasta HG002.HiC_2_NovaSeq_rep1_run2_S1_L001_R1_001.fastq.gz HG002.HiC_2_NovaSeq_rep1_run2_S1_L001_R2_001.fastq.gz | samtools sort -n -@ 47 - -o hic_to_assembly.sorted_by_read.bam
phase_contacts_with_monte_carlo -i hic_to_assembly.sorted_by_read.bam -g Assembly-Phased.gfa -o gfase -m 1 -t 48
Ok thanks.
Were you also able to try invoking --use_simple_chainer
? Since this is a Shasta graph you probably won't see a huge benefit from the hamiltonian chainer (which is the new default).
I have processed the data and unfortunately can't replicate this crash. However, I aligned with BWA command before you posted your command, which is slightly different. I will try to replicate it exactly to see if that accounts for the discrepancy. Alternatively, you could provide your BAM files directly. Can you also check which commit/release of GFAse you're using?
I was able to replicate the crash by using the exact read mapping command that you did, and it should now be resolved.
I'm sorry it took so long to reply to you. I have tried the newest version. The output of phase_0 and phase_1 is about 18M, and unphased is about 5.6G.
The command I used: phase_contacts_with_monte_carlo -i hic_to_assembly.sorted_by_read.bam -g Assembly-Phased.gfa -o gfase -m 1 -t 48
Hi @csuzhang we have started looking at this again. To help speed things up could you please do the following:
chained.gfa
and the contacts.csv
that was generated in your output directory--use_simple_chainer
to see if that resolves the problem (we will use this information to narrow down the bug)Thank you
bwa mem -t 48 -5 -SP Assembly-Phased.fasta HG002.HiC_2_NovaSeq_rep1_run2_S1_L001_R1_001.fastq.gz HG002.HiC_2_NovaSeq_rep1_run2_S1_L001_R2_001.fastq.gz | samtools sort -n -@ 48 - -o hic_to_assembly.sorted_by_read.bam
--use_simple_chainer
, the output of phase_0 and phase_1 is about 2.5G and unphased is about 370M.Thank you @csuzhang this is very helpful, we will work on this. In the mean time hopefully the simple chainer will get you what you need
@csuzhang could you also please share the phases.csv
that you obtained from your original problematic run? It may help us debug faster
@rlorigro the files uploaded before may be wrong, here is the correct files contacts.csv.gz, phases.csv, chained.gfa, Assembly-Phased.gfa.
Ok thanks
@csuzhang For the Google Drive links, would you mind giving access to either joeizeng@gmail.com or jeizenga@ucsc.edu?
I believe this bug should now be fixed in the main branch. Thanks for pointing this issue out!
Hi: I'm running GFAse on Shasta assembly. The program is terminated when unzipping chains. Here is the log:
Final unmerged score: 1.52998e+06
[0h 39m 24s] Writing phasing results to file...
[0h 39m 24s] Chaining homologous sequences...
[0h 39m 26s] Writing GFA...
[0h 39m 33s] Unzipping chains...
terminate called after throwing an instance of 'std::out_of_range'
what(): at: key not present
Command terminated by signal 6