GFA with no P lines - Githubissues

sivico26 commented 2 years ago

Hello,

I ran pggb (v0.4 from conda) to build a multispecies pangenome (see #226 for details). The command I used was the following:

outdir="out" 
idt="85"
nhaps="8" ## 7 genomes, but there is one that has 2 Haplotypes
match_chunk_size="200000" ## I now know this was a step in the wrong direction

pggb -i $input.gz -n $nhaps -s $match_chunk_size -p $idt -P asm20 -M -v -t $threads -o $outdir

After taking around 27 days in the mapping step, it finished early after an error emerged calling smoothxg. These are the last lines of input.fasta.gz.ef37fbb.417fcdf.3bf9c7c.smooth.08-17-2022_11:09:02.log

[...]
[seqwish::transclosure] 23422.924 100.00% building node_iitree and path_iitree indexes
[seqwish::transclosure] 23443.567 100.00% done
[seqwish::transclosure] 23443.579 done with transitive closures
[seqwish::compact] 23443.579 compacting nodes
[seqwish::compact] 23559.842 done compacting
[seqwish::compact] 23568.476 built node index
[seqwish::links] 23568.476 finding graph links
[seqwish::links] 23723.043 links derived
[seqwish::gfa] 23723.043 writing graph
[seqwish::gfa] 25058.076 done
seqwish -t 40 -s input.fasta.gz -p out/input.fasta.gz.ef37fbb.wfmash.paf -k 19 -f 0 -g out/input.fasta.gz.ef37fbb.417fcdf.seqwish.gfa -B 10000000 --temp-dir out -P
288232.02s user 15186.81s system 1210% cpu 25065.40s total 116391292Kb max memory
[smoothxg::main] loading graph
[smoothxg::main] prepping graph for smoothing
terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_M_create
Command terminated by signal 6
smoothxg -t 40 -T 40 -g out/input.fasta.gz.ef37fbb.417fcdf.seqwish.gfa -w 5600 -b out -X 100 -I .8500 -R 0 -j 0 -e 0 -l 700 -P 1,4,6,2,26,1 -O 0.001 -Y 800 -d 0 -D 0 -S -V -o out/input.fasta.gz.ef37fbb.417fcdf.3bf9c7c.smooth.1.gfa
105.66s user 6.04s system 104% cpu 106.59s total 8790608Kb max memory

Digging into this problem, I found that it was similar to #133 and #182. Following @AndreaGuarracino 's hint in those issues, I found out that the seqwish yielded .gfa without P lines! (To be precise input.fasta.gz.ef37fbb.417fcdf.seqwish.gfa).

From the above, I have the following questions:

Have there been advances or further understanding of this problem since the time of the aforementioned issues?
Do you know why seqwish could have produced such .gfa? Is this plausible in some scenario and tell us something about the input alignments (.paf) or on the contrary should be an impossibility?
From the other issues, I understand that the high segment-length I chose (200k) has something to do with the result. I also understand that lowering this parameter (to maybe 50k?) might help. However, I also understand that doing so would increase the compute time (especially in the mapping step), which is already quite long. Is there a way to address this problem resuming the pipeline where it is now (after the mapping step)? Or do you think it would be necessary to start from scratch? If the latter, I would appreciate some input regarding the second question of #226.

Thank you for your time and for developing pggb. Sivico

AndreaGuarracino commented 1 year ago

About questions 1 and 2, seqwish must always emit P lines. If you can somehow share your input FASTA file and PAF file, we could check what is triggering this buggy behavior of seqwish.

Regarding the runtime problem in question 3, the identity threshold is the problem here (-p 85). We don't have very good alternatives to improve performance in this regard yet. For now, I would suggest exploring your dataset with a bit higher identity thresholds. Of note, using -p 90, wfmash will catch also mappings with a bit lower estimated identity, so it could work in your case. As for the segment length, I would suggest using 50k (or at most 100k if there are too big performance issues).

sivico26 commented 1 year ago

Thanks for asnwering @AndreaGuarracino

About questions 1 and 2, seqwish must always emit P lines. If you can somehow share your input FASTA file and PAF file, we could check what is triggering this buggy behavior of seqwish.

Unfortunately, I am on the receiving end of a consortium that generated the data, and I am not sure how tight are the sharing policies. if you are interested in hunting down this bug, I can ask and let you know.

Regarding the runtime problem in question 3, the identity threshold is the problem here (-p 85). We don't have very good alternatives to improve performance in this regard yet. For now, I would suggest exploring your dataset with a bit higher identity thresholds. Of note, using -p 90, wfmash will catch also mappings with a bit lower estimated identity, so it could work in your case. As for the segment length, I would suggest using 50k (or at most 100k if there are too big performance issues).

I see. I might give it a try but basically depends on the following: I have a question regarding the relationship between the parameters and the sensitivity/accuracy of the alignments found by wfmash. In your mind, being everything else equal (input data and the other parameters), would you expect the alignments recovered with -p 90 to be a subset of the alignments recovered with -p 85? As far as I understand, this is the main difference (besides/at the expense of performance).

I am asking because I am already having severe under-alignment issues even at -p 80 for my data. So, unless for some reason I should expect higher sensitivity at, let's say, -p 90 than what I am already getting at -p 80, I see no point in trying to tweak performance if the results are not there in the first place.

Funny enough, I also did some tests using minimap2 and it seems to yield good alignments (sensitive enough), but as @ekg pointed out in other issues (e.g. here), the current minimap2 implementation is basically unviable for this application performance-wise (time-wise might work, but excessive memory consumption). I can expand on this under-alignment problem for high-divergence cases (in my tests ~8%) in another Issue if you are interested.

Regards Sivico

AndreaGuarracino commented 1 year ago

Hi @sivico26, I don't think the alignments recovered with -p 90 would be exactly a subset of the ones recovered with -p 85, but I would expect a strong overlap between the two sets of alignments. Perhaps, have you already checked that?

sivico26 commented 1 year ago

Hi @AndreaGuarracino,

I have not run the set at -p 90. And I think I won't. Simply because with -p 85 I am seeing that only 1% of the bases between two homolog chromosomes are aligning. So, based on your expectation of strong overlap (which I shared), it seems pointless to run the set -p 90 if it would yield around the same 1%.

AndreaGuarracino commented 1 year ago

Regarding question 3, we are working on improving the performance with lower identity thresholds. The bottleneck is the mapping phase, where the aligner (wfmash) has to find the homology map between all input sequences, given the input segment length and estimated identity threshold. Preliminary results are quite hot. Stay tuned, something could pop up soon (weeks / a very few months),

pangenome / pggb

GFA with no P lines #229