rlorigro / GFAse

Tool for globally phasing diploid assembly graphs with orthogonal data
Mozilla Public License 2.0
36 stars 4 forks source link

Segfault when running GFAse #13

Closed ivargr closed 1 year ago

ivargr commented 1 year ago

Hi!

I'm trying out gfase on a rather small dataset (some simulated HiC reads and a simple graph), and get this error:

[0h 0m 0s] Loading GFA... Found index, loading from disk: "hifiasm.hic.p_ctg.gfai" ... diff index_file_stat, gfa_file_stat: 448 done Creating nodes... Creating edges... [0h 0m 0s] Writing IDs to file... [0h 0m 0s] Loading alignments as contact map... [0h 0m 0s] Finding alts with sequence homology... Using 122978293824730352 of 18446744073709551615 possible bins, for 6 iterations at a rate of 0.00666667 2602294 possible unique kmers in sequence 104091 kmers after downsampling 520455 bins allocated Beginning iteration: 0 Beginning iteration: 1 Beginning iteration: 2 Beginning iteration: 3 Beginning iteration: 4 Beginning iteration: 5 Aggregating overlap info... Filtering overlaps... Sorting... Found 1 pairs Aligning 0 pairs [0h 0m 0s] Done [0h 0m 0s] Removing self edges... [0h 0m 0s] Writing contacts to file... [0h 0m 0s] Optimizing phases... ---- 0 ---- Segmentation fault (core dumped)

The command I'm running is:

phase_contacts_with_monte_carlo -i hic.sorted_by_read_name.bam -g hifiasm.hic.p_ctg.gfa -o gfase --use_homology

I've attached the HiC reads and gfa I'm using:

data.zip

Are there any known issues you know of that can cause segfault? Could it be that the small dataset I'm using is too small? Is there any way I can debug this?

Edit: I see that I submitted and used the contig graph from hifiasm and not the unitig graph. However, the same error occurs with the unitig graph, and I assume it anyway shouldn't crash with the contig graph.

rlorigro commented 1 year ago

Hi @ivargr

This is because the homology step failed to find your bubbles (visible in the line Aligning 0 pairs).

I have recently begun addressing this issue in the hash_args branch by catching this error and generating a more informative error message, as well as adding some options to modify the homology parameters to fit your usage. Feel free to try that branch out if you are interested, or I will merge it soon.

Normally I would expect hifiasm contigs to pass the candidate filtering step. Could you please post the contents of the overlaps.csv that was generated during your above run? I took a quick look at your GFA and you have just two contigs, with fairly different sizes (1Mbp vs 1.5Mbp)

ivargr commented 1 year ago

Thanks for the reply!

It makes sense that it will fail when not finding bubbles in that graph.

I reran on a more complex graph, and it now seems that it finds bubbles (aligning 33 pairs). However, it is still failing with segfault:

Creating nodes... Creating edges... [0h 0m 0s] Writing IDs to file... [0h 0m 0s] Loading alignments as contact map... [0h 0m 0s] Finding alts with sequence homology... Using 122978293824730352 of 18446744073709551615 possible bins, for 6 iterations at a rate of 0.00666667 5412442 possible unique kmers in sequence 216497 kmers after downsampling 1082485 bins allocated Beginning iteration: 0 Beginning iteration: 1 Beginning iteration: 2 Beginning iteration: 3 Beginning iteration: 4 Beginning iteration: 5 Aggregating overlap info... Filtering overlaps... Sorting... Found 159 pairs Aligning 33 pairs [0h 0m 0s] Done [0h 0m 0s] Removing self edges... [0h 0m 0s] Writing contacts to file... [0h 0m 0s] Optimizing phases... ---- 0 ---- Segmentation fault (core dumped)

The overlaps.csv contains many lines, the first are:

name,other_name,score,total_hashes,similarity utg000021l,utg000042l,23925,24691,0.968977 utg000021l,utg000033l,443,24691,0.0179418 utg000021l,utg000029l,438,24691,0.0177393 utg000021l,utg000008l,436,24691,0.0176583 utg000021l,utg000030l,411,24691,0.0166457 utg000021l,utg000001l,380,24691,0.0153902 utg000021l,utg000002l,379,24691,0.0153497 utg000021l,utg000041l,378,24691,0.0153092 utg000021l,utg000014l,372,24691,0.0150662

I ran the above after compiling the hash_args branch, but it doesn't seem like it gives me any more output.

I've attached the graph, reads and the overlaps.csv.

data2.zip

rlorigro commented 1 year ago

Hi,

I’m guessing that this is now actually a different issue than the first one you posted. Can you share the full output directory, including any other CSVs? And ideally the BAM that you have used as well.

ivargr commented 1 year ago

Sure, this should be the output directory (created before it crashes):

gfase.zip

The bam is attached in my previous post (data2.zip).

rlorigro commented 1 year ago

Hi @ivargr ,

By loading the alignments.csv into bandage you can see that the homology detection is now working reasonably well:

graph

However, looking at your contacts.csv, there appear to be no contacts, because the file is empty. Now checking out your BAM file, I cannot find a single pair of reads that spans more than one contig:

SIM3C:1682514978:WGS:1:1:1:7     163  ptg000001l  560646   60  150M                          =           560969   323       GTTCGTTTGTTGAGTTAGCTGTAATGCCTTCGGAGTCAGCTGGCTCGCGATGGCGAATACACCGTGCA>
SIM3C:1682514978:WGS:1:1:1:8     83   ptg000001l  279507   60  150M                          =           279274   -233      CTAGGGCGGCAGGAAGGGCCATGCCAATGGAAAGGAAAAATCCTTGCGATATATATTTTAATTGCGAA>
SIM3C:1682514978:WGS:1:1:1:8     163  ptg000001l  279274   60  150M                          =           279507   233       GTCGTTATACGACCTGGTAGGGCCCATGATGGCTCTTTCAATAGTGTAGCCGTTATTGTTCCAAATGA>
SIM3C:1682514978:WGS:1:1:1:9     83   ptg000001l  889749   60  150M                          =           889486   -263      AGTGGCCAAGCTCAAAAGGAAGAACCCGATGAAAAAATCGAACAAGATATTAGCAAGAGTGACCAAAA>
SIM3C:1682514978:WGS:1:1:1:9     163  ptg000001l  889486   60  150M                          =           889749   263       TGTTAGCTGTCGGACTGCTTCCACCTTACTTTGAACTAGCCAAAAGGAAAGGTCGTGTTATTGGCATC>
SIM3C:1682514978:WGS:1:1:1:10    99   ptg000002l  289563   60  150M                          =           289845   282       TCAACAACCAAATCTTGCATTTCCCTGAAAATAGTGCTCACTTCTAATACACCCCTTGACAACTGAGT>
SIM3C:1682514978:WGS:1:1:1:10    147  ptg000002l  289845   60  150M                          =           289563   -282      TAATTATTTTGTAATACTCTGAACTTATTACTTTCAGTTTGTATCTTTTCTGCATAGATCTTCTGTAG>
SIM3C:1682514978:WGS:1:1:1:12    83   ptg000001l  949215   60  150M                          =           948953   -262      CTACATTTTTACTTGAAATTTCCGCTGTATTTTTTTGCATACGTGCGCTGGGTCTCTAACCACTGTAT>
SIM3C:1682514978:WGS:1:1:1:12    163  ptg000001l  948953   60  150M                          =           949215   262       TATCCCGAGTTCATGACAGACTGATTTTGTAATATCCTATCACTCCAGTCGTCCTCATCGATGTAAGA>
SIM3C:1682514978:WGS:1:1:1:14    83   ptg000001l  1243215  60  150M                          =           1243025  -190      AATTCAAACTCAAAACCTGGAGCACGACCACGACCAACATACTAATGATATGAGTGCTTCATCGAATG>
SIM3C:1682514978:WGS:1:1:1:14    163  ptg000001l  1243025  60  150M                          =           1243215  190       GGCTAGACAAGTCACGGCTTACTGCTAAATAACGTATACAATACGCTATGATGGAAGAATTCTCGTAC>
SIM3C:1682514978:WGS:1:1:1:16    83   ptg000001l  1404646  60  150M                          =           1404514  -132      TGAGCGGTGTTTGACGTTTGGTTTGCTGTAGAAATAGTACCTATTGGGAATTGTTGTGATGATTGAGG>
SIM3C:1682514978:WGS:1:1:1:16    163  ptg000001l  1404514  60  150M                          =           1404646  132       TTGGGTTGAGTGGTTGCAGTTGGAGCATATGGATTAGAGGCCTTTGGCAACGGAGCTGGAGGCACGCC>
SIM3C:1682514978:WGS:1:1:1:17    83   ptg000001l  1493569  60  42M1I107M                     =           1493330  -239      TTCGCATAGAAATTCGGTGCTTAACTTGTAATTGCACCTTTGTTTTTGAGAAATCCATGAATTTCGCG>
SIM3C:1682514978:WGS:1:1:1:17    163  ptg000001l  1493330  60  150M                          =           1493569  239       TCTATAATCTAAATTATTTTAGTGCCCATAATGCAGTATACTGTATCGTATAGGAGTTCTGACTTGCT>
SIM3C:1682514978:WGS:1:1:1:18    83   ptg000001l  1282545  60  113M1I36M                     =           1282278  -267      AGGAACTACGTCAATTAGTTTACAAAACATTCCCGATTACGTATGTCATATTCATCGAAGTTCGTAAG>
SIM3C:1682514978:WGS:1:1:1:18    163  ptg000001l  1282278  60  150M                          =           1282545  267       CGGGATGACTGTTCGCAACAGTTAATAGTGGGGTAGTTGCATCCATTGTATCTATGTTGTTTTATTTG>
SIM3C:1682514978:WGS:1:1:1:19    99   ptg000002l  930997   60  150M                          =           931191   194       AAAGTCTGCAGCCACTGTGGCCATTTACCAATGTTCTAAAGATGCCTATGATTTGTTCGATAAAGTCT>
SIM3C:1682514978:WGS:1:1:1:19    147  ptg000002l  931191   60  150M                          =           930997   -194      GTCCTACTGAAAGAATAAAGGACAAAGACATGGTCAAACATGGAATTATGATCCCACAAACGGCCTAT>

Unfortunately it seems that your simulator has no means to simulate contacts between contigs (which is understandable because it would be some work to code it). GFAse relies on the signal generated by intra-chromosomal contacts, which span at least 2 contigs. Self-contacts within contigs do not add any phasing signal.

I will add yet another exception to handle your case when no contacts are found, and print a more informative error message.

ivargr commented 1 year ago

Ah, you are absolutely correct that no hic reads are spanning contigs.

After fixing my simulation, so that I get enough reads, GFAse seems to work well. Yes, it would be nice with an exception when no contacts are found.

PS: This case was generated from a simulation/benchmarking pipeline I'm working on for phasing/scaffolding, which is able to simulate genomes and hic+hifi reads. It's very premature for now, but currently adding gfase and will play around with different tools (https://github.com/ivargr/hic-assembly-benchmarking).

Thanks a lot for looking into this! :)

rlorigro commented 1 year ago

Ah ok cool, glad it’s working now. When you say scaffolding, does that mean you will also be evaluating the paths (chains of contigs) that are generated by the tool?

Also I see that in your repository you are installing manually, but there is also a Docker image available if that would be useful. I will update the main branch soon.

ivargr commented 1 year ago

Nice with the Docker image, I'll try to get the snakemake pipeline working with that. Would also be very cool with a conda package (I know many people prefer that).

The plan is to also benchmark scaffolding in that pipeline (such as yahs). As I've understood GFAse doesn't really do scaffolding, so for now just playing around with GFAse and hifiasm HiC phasing to see how those compare.

rlorigro commented 1 year ago

I see, well there is a final step in which a path through the phased contigs is generated, which could be construed as scaffolding. We don't currently stitch the contigs that comprise the path for Hifiasm because Hifiasm GFAs have incorrect overlap information in their L lines, but we are looking at ways to realign them. However in any GFA which has blunt ends, we can "unzip" the haplotypes.

The docker image is here: https://hub.docker.com/r/meredith705/gfase/tags

There is also the WDL which automates the alignment steps if needed. You could reuse it in a frankenstein Snakemake/WDL pipeline or you could just use it for inspiration.