Closed ivargr closed 1 year ago
Hi @ivargr
This is because the homology step failed to find your bubbles (visible in the line Aligning 0 pairs
).
I have recently begun addressing this issue in the hash_args
branch by catching this error and generating a more informative error message, as well as adding some options to modify the homology parameters to fit your usage. Feel free to try that branch out if you are interested, or I will merge it soon.
Normally I would expect hifiasm contigs to pass the candidate filtering step. Could you please post the contents of the overlaps.csv that was generated during your above run? I took a quick look at your GFA and you have just two contigs, with fairly different sizes (1Mbp vs 1.5Mbp)
Thanks for the reply!
It makes sense that it will fail when not finding bubbles in that graph.
I reran on a more complex graph, and it now seems that it finds bubbles (aligning 33 pairs). However, it is still failing with segfault:
Creating nodes... Creating edges... [0h 0m 0s] Writing IDs to file... [0h 0m 0s] Loading alignments as contact map... [0h 0m 0s] Finding alts with sequence homology... Using 122978293824730352 of 18446744073709551615 possible bins, for 6 iterations at a rate of 0.00666667 5412442 possible unique kmers in sequence 216497 kmers after downsampling 1082485 bins allocated Beginning iteration: 0 Beginning iteration: 1 Beginning iteration: 2 Beginning iteration: 3 Beginning iteration: 4 Beginning iteration: 5 Aggregating overlap info... Filtering overlaps... Sorting... Found 159 pairs Aligning 33 pairs [0h 0m 0s] Done [0h 0m 0s] Removing self edges... [0h 0m 0s] Writing contacts to file... [0h 0m 0s] Optimizing phases... ---- 0 ---- Segmentation fault (core dumped)
The overlaps.csv contains many lines, the first are:
name,other_name,score,total_hashes,similarity utg000021l,utg000042l,23925,24691,0.968977 utg000021l,utg000033l,443,24691,0.0179418 utg000021l,utg000029l,438,24691,0.0177393 utg000021l,utg000008l,436,24691,0.0176583 utg000021l,utg000030l,411,24691,0.0166457 utg000021l,utg000001l,380,24691,0.0153902 utg000021l,utg000002l,379,24691,0.0153497 utg000021l,utg000041l,378,24691,0.0153092 utg000021l,utg000014l,372,24691,0.0150662
I ran the above after compiling the hash_args
branch, but it doesn't seem like it gives me any more output.
I've attached the graph, reads and the overlaps.csv.
Hi,
I’m guessing that this is now actually a different issue than the first one you posted. Can you share the full output directory, including any other CSVs? And ideally the BAM that you have used as well.
Sure, this should be the output directory (created before it crashes):
The bam is attached in my previous post (data2.zip).
Hi @ivargr ,
By loading the alignments.csv into bandage you can see that the homology detection is now working reasonably well:
However, looking at your contacts.csv
, there appear to be no contacts, because the file is empty. Now checking out your BAM file, I cannot find a single pair of reads that spans more than one contig:
SIM3C:1682514978:WGS:1:1:1:7 163 ptg000001l 560646 60 150M = 560969 323 GTTCGTTTGTTGAGTTAGCTGTAATGCCTTCGGAGTCAGCTGGCTCGCGATGGCGAATACACCGTGCA>
SIM3C:1682514978:WGS:1:1:1:8 83 ptg000001l 279507 60 150M = 279274 -233 CTAGGGCGGCAGGAAGGGCCATGCCAATGGAAAGGAAAAATCCTTGCGATATATATTTTAATTGCGAA>
SIM3C:1682514978:WGS:1:1:1:8 163 ptg000001l 279274 60 150M = 279507 233 GTCGTTATACGACCTGGTAGGGCCCATGATGGCTCTTTCAATAGTGTAGCCGTTATTGTTCCAAATGA>
SIM3C:1682514978:WGS:1:1:1:9 83 ptg000001l 889749 60 150M = 889486 -263 AGTGGCCAAGCTCAAAAGGAAGAACCCGATGAAAAAATCGAACAAGATATTAGCAAGAGTGACCAAAA>
SIM3C:1682514978:WGS:1:1:1:9 163 ptg000001l 889486 60 150M = 889749 263 TGTTAGCTGTCGGACTGCTTCCACCTTACTTTGAACTAGCCAAAAGGAAAGGTCGTGTTATTGGCATC>
SIM3C:1682514978:WGS:1:1:1:10 99 ptg000002l 289563 60 150M = 289845 282 TCAACAACCAAATCTTGCATTTCCCTGAAAATAGTGCTCACTTCTAATACACCCCTTGACAACTGAGT>
SIM3C:1682514978:WGS:1:1:1:10 147 ptg000002l 289845 60 150M = 289563 -282 TAATTATTTTGTAATACTCTGAACTTATTACTTTCAGTTTGTATCTTTTCTGCATAGATCTTCTGTAG>
SIM3C:1682514978:WGS:1:1:1:12 83 ptg000001l 949215 60 150M = 948953 -262 CTACATTTTTACTTGAAATTTCCGCTGTATTTTTTTGCATACGTGCGCTGGGTCTCTAACCACTGTAT>
SIM3C:1682514978:WGS:1:1:1:12 163 ptg000001l 948953 60 150M = 949215 262 TATCCCGAGTTCATGACAGACTGATTTTGTAATATCCTATCACTCCAGTCGTCCTCATCGATGTAAGA>
SIM3C:1682514978:WGS:1:1:1:14 83 ptg000001l 1243215 60 150M = 1243025 -190 AATTCAAACTCAAAACCTGGAGCACGACCACGACCAACATACTAATGATATGAGTGCTTCATCGAATG>
SIM3C:1682514978:WGS:1:1:1:14 163 ptg000001l 1243025 60 150M = 1243215 190 GGCTAGACAAGTCACGGCTTACTGCTAAATAACGTATACAATACGCTATGATGGAAGAATTCTCGTAC>
SIM3C:1682514978:WGS:1:1:1:16 83 ptg000001l 1404646 60 150M = 1404514 -132 TGAGCGGTGTTTGACGTTTGGTTTGCTGTAGAAATAGTACCTATTGGGAATTGTTGTGATGATTGAGG>
SIM3C:1682514978:WGS:1:1:1:16 163 ptg000001l 1404514 60 150M = 1404646 132 TTGGGTTGAGTGGTTGCAGTTGGAGCATATGGATTAGAGGCCTTTGGCAACGGAGCTGGAGGCACGCC>
SIM3C:1682514978:WGS:1:1:1:17 83 ptg000001l 1493569 60 42M1I107M = 1493330 -239 TTCGCATAGAAATTCGGTGCTTAACTTGTAATTGCACCTTTGTTTTTGAGAAATCCATGAATTTCGCG>
SIM3C:1682514978:WGS:1:1:1:17 163 ptg000001l 1493330 60 150M = 1493569 239 TCTATAATCTAAATTATTTTAGTGCCCATAATGCAGTATACTGTATCGTATAGGAGTTCTGACTTGCT>
SIM3C:1682514978:WGS:1:1:1:18 83 ptg000001l 1282545 60 113M1I36M = 1282278 -267 AGGAACTACGTCAATTAGTTTACAAAACATTCCCGATTACGTATGTCATATTCATCGAAGTTCGTAAG>
SIM3C:1682514978:WGS:1:1:1:18 163 ptg000001l 1282278 60 150M = 1282545 267 CGGGATGACTGTTCGCAACAGTTAATAGTGGGGTAGTTGCATCCATTGTATCTATGTTGTTTTATTTG>
SIM3C:1682514978:WGS:1:1:1:19 99 ptg000002l 930997 60 150M = 931191 194 AAAGTCTGCAGCCACTGTGGCCATTTACCAATGTTCTAAAGATGCCTATGATTTGTTCGATAAAGTCT>
SIM3C:1682514978:WGS:1:1:1:19 147 ptg000002l 931191 60 150M = 930997 -194 GTCCTACTGAAAGAATAAAGGACAAAGACATGGTCAAACATGGAATTATGATCCCACAAACGGCCTAT>
Unfortunately it seems that your simulator has no means to simulate contacts between contigs (which is understandable because it would be some work to code it). GFAse relies on the signal generated by intra-chromosomal contacts, which span at least 2 contigs. Self-contacts within contigs do not add any phasing signal.
I will add yet another exception to handle your case when no contacts are found, and print a more informative error message.
Ah, you are absolutely correct that no hic reads are spanning contigs.
After fixing my simulation, so that I get enough reads, GFAse seems to work well. Yes, it would be nice with an exception when no contacts are found.
PS: This case was generated from a simulation/benchmarking pipeline I'm working on for phasing/scaffolding, which is able to simulate genomes and hic+hifi reads. It's very premature for now, but currently adding gfase and will play around with different tools (https://github.com/ivargr/hic-assembly-benchmarking).
Thanks a lot for looking into this! :)
Ah ok cool, glad it’s working now. When you say scaffolding, does that mean you will also be evaluating the paths (chains of contigs) that are generated by the tool?
Also I see that in your repository you are installing manually, but there is also a Docker image available if that would be useful. I will update the main branch soon.
Nice with the Docker image, I'll try to get the snakemake pipeline working with that. Would also be very cool with a conda package (I know many people prefer that).
The plan is to also benchmark scaffolding in that pipeline (such as yahs). As I've understood GFAse doesn't really do scaffolding, so for now just playing around with GFAse and hifiasm HiC phasing to see how those compare.
I see, well there is a final step in which a path through the phased contigs is generated, which could be construed as scaffolding. We don't currently stitch the contigs that comprise the path for Hifiasm because Hifiasm GFAs have incorrect overlap information in their L lines, but we are looking at ways to realign them. However in any GFA which has blunt ends, we can "unzip" the haplotypes.
The docker image is here: https://hub.docker.com/r/meredith705/gfase/tags
There is also the WDL which automates the alignment steps if needed. You could reuse it in a frankenstein Snakemake/WDL pipeline or you could just use it for inspiration.
Hi!
I'm trying out gfase on a rather small dataset (some simulated HiC reads and a simple graph), and get this error:
The command I'm running is:
I've attached the HiC reads and gfa I'm using:
data.zip
Are there any known issues you know of that can cause segfault? Could it be that the small dataset I'm using is too small? Is there any way I can debug this?
Edit: I see that I submitted and used the contig graph from hifiasm and not the unitig graph. However, the same error occurs with the unitig graph, and I assume it anyway shouldn't crash with the contig graph.