nadegeguiglielmoni / GraphUnzip

Unzip assembly graphs with Hi-C data and/or long reads.
GNU General Public License v3.0
25 stars 1 forks source link

Graphaligner prints "Assertion 'previous.node(neighbor).endSlice.scoreEnd >= scoreHere-(eq?0:1)' failed." #19

Closed ykkim0127 closed 2 years ago

ykkim0127 commented 2 years ago

Hi. I'm trying to run graphunzip using clr assembly with Hi-C and 10X Genomics data. The problem is clr, when running Graphaligner this prints some "failed" lines as below but running for 5 or more days with 68GB gaf file. I wonder 1) what this means, 2) is it okay to use this gaf file for further analysis and 3) when it ends. The input for clr is about 148G.

src/GraphAlignerBitvectorCommon.h:1118: Assertion 'previous.node(neighbor).endSlice.scoreEnd >= scoreHere-(eq?0:1)' failed. Read: m64062_191225_080113/2426748/53596_62084. Seed: 0+,0,0,0 src/GraphAlignerBitvectorCommon.h:1118: Assertion 'previous.node(neighbor).endSlice.scoreEnd >= scoreHere-(eq?0:1)' failed. Read: m64062_191225_080113/3735880/0_18706. Seed: 0+,0,0,0 src/GraphAlignerBitvectorCommon.h:1118: Assertion 'previous.node(neighbor).endSlice.scoreEnd >= scoreHere-(eq?0:1)' failed. Read: m64062_191225_080113/4000373/0_20174. Seed: 0+,0,0,0 src/GraphAlignerBitvectorCommon.h:1118: Assertion 'previous.node(neighbor).endSlice.scoreEnd >= scoreHere-(eq?0:1)' failed. Read: m64062_191225_080113/7602798/20181_41804. Seed: 0+,0,0,0 src/GraphAlignerBitvectorCommon.h:1118: Assertion 'previous.node(neighbor).endSlice.scoreEnd >= scoreHere-(eq?0:1)' failed. Read: m64062_191225_080113/9374255/0_26356. Seed: 0+,0,0,0 src/GraphAlignerBitvectorCommon.h:1118: Assertion 'previous.node(neighbor).endSlice.scoreEnd >= scoreHere-(eq?0:1)' failed. Read: m64062_191225_080113/9766432/5579_25226. Seed: 0+,0,0,0 src/GraphAlignerBitvectorCommon.h:1118: Assertion 'previous.node(neighbor).endSlice.scoreEnd >= scoreHere-(eq?0:1)' failed. Read: m64062_191225_080113/9766522/53343_176379. Seed: 0+,0,0,0 src/GraphAlignerBitvectorCommon.h:1118: Assertion 'previous.node(neighbor).endSlice.scoreEnd >= scoreHere-(eq?0:1)' failed. Read: m64062_191225_080113/14092620/0_5016. Seed: 0+,0,0,0 src/GraphAlignerBitvectorCommon.h:1118: Assertion 'previous.node(neighbor).endSlice.scoreEnd >= scoreHere-(eq?0:1)' failed. Read: m64062_191225_080113/16255005/0_33636. Seed: 0+,0,0,0 ...

I've used this commands to use three datas.

For Hi-C

$hicstuff pipeline -t 24 --mapping=iterative -o mapping/ -g assembly-gfa.fasta -e MboI Hi-C_read_1.fq.gz Hi-C_read_2.fq.gz $graphunzip.py HiC-IM -g assembly_graph.gfa -m /mapping/abs_fragments_contacts_weighted.txt -F fragments_list.txt -i hic_interactionmatrix.txt

For CLR

$GraphAligner -g assembly_graph.gfa -f Raw/m64062_m64032_subreads.fa -a m64062_m64032_2.gaf -t 64 -x vg

For 10X Genomics

$bwa mem -t 80 assembly-gfa.fasta Raw/10X_1.fq.gz 10X_2.fq.gz -C | samtools view -bh -@ 80 -o aln-10x-asm_re.bam

RolandFaure commented 2 years ago

Hi, To answer your questions sequentially: 1) I'm not sure exactly. I see you've opened an issue on GraphAligner's GtiHub, my guess is that the GFA assembly contains contigs of length 0, or contigs that are shorter than the CIGAR strings of their overlaps 2) Yes, the gaf file is still valid to use for GraphUnzip. If my guess above is correct and the alignment fails in some specific regions, GraphUnzip won't unzip these regions, but the rest of the assembly should be unzipped fine. 3) Alignment with GraphAligner is a bottleneck for speed for now. To see how close you are to the end, you can go look directly to the end of the GAF file (using less+G), you'll see what is the last read GraphAligner processed. Since GraphAligner processes the reads of the input file sequentially, if the last read was read_500000 and you have 1000000 in the input file, you're halfway through. To speed up this step you can select a subset of the long reads, preferably the longest of them. A rule of thumb is that your .fa file should be 20 times bigger than your .gfa file.

ykkim0127 commented 2 years ago

Thanks for explaining ! The problem is gaf file is over 70GB, raw data is 148GB. I checked process of gaf file (count number of sequence) and it is not even half. It seems same query is aligned to several nodes with various identity. Then should I put some threshold options (--multimap-score-fraction , --min-alignment-score)? And would it be ok to run graphunzip ? or should I proceed with default option ?

RolandFaure commented 2 years ago

Generally speaking, you have little to fear with tweaking the parameters of GraphAligner, it should not mislead GraphUnzip (except if you're really trying to). The question of multimapping is actually very interesting and important, it is crucial that GraphAligner retains only the best alignment. You should use the parameter --multimap-score-fraction 1 (this will make GraphAligner faster too). I'm going to add this in the README. If you don't want to re-run GraphAligner, you can filter the existing file and retain only the best alignment using awk and fields 10 and 11 or 12 of gaf format. You can probably use the gaf file as it is, your raw data is probably huge compared to the size of the genome.

ykkim0127 commented 2 years ago

I got it. Thanks alot 👍