Closed sivico26 closed 1 year ago
About questions 1 and 2, seqwish
must always emit P lines. If you can somehow share your input FASTA file and PAF file, we could check what is triggering this buggy behavior of seqwish
.
Regarding the runtime problem in question 3, the identity threshold is the problem here (-p 85
). We don't have very good alternatives to improve performance in this regard yet. For now, I would suggest exploring your dataset with a bit higher identity thresholds. Of note, using -p 90
, wfmash
will catch also mappings with a bit lower estimated identity, so it could work in your case. As for the segment length, I would suggest using 50k (or at most 100k if there are too big performance issues).
Thanks for asnwering @AndreaGuarracino
About questions 1 and 2, seqwish must always emit P lines. If you can somehow share your input FASTA file and PAF file, we could check what is triggering this buggy behavior of seqwish.
Unfortunately, I am on the receiving end of a consortium that generated the data, and I am not sure how tight are the sharing policies. if you are interested in hunting down this bug, I can ask and let you know.
Regarding the runtime problem in question 3, the identity threshold is the problem here (-p 85). We don't have very good alternatives to improve performance in this regard yet. For now, I would suggest exploring your dataset with a bit higher identity thresholds. Of note, using -p 90, wfmash will catch also mappings with a bit lower estimated identity, so it could work in your case. As for the segment length, I would suggest using 50k (or at most 100k if there are too big performance issues).
I see. I might give it a try but basically depends on the following: I have a question regarding the relationship between the parameters and the sensitivity/accuracy of the alignments found by wfmash
. In your mind, being everything else equal (input data and the other parameters), would you expect the alignments recovered with -p 90
to be a subset of the alignments recovered with -p 85
? As far as I understand, this is the main difference (besides/at the expense of performance).
I am asking because I am already having severe under-alignment issues even at -p 80
for my data. So, unless for some reason I should expect higher sensitivity at, let's say, -p 90
than what I am already getting at -p 80
, I see no point in trying to tweak performance if the results are not there in the first place.
Funny enough, I also did some tests using minimap2
and it seems to yield good alignments (sensitive enough), but as @ekg pointed out in other issues (e.g. here), the current minimap2
implementation is basically unviable for this application performance-wise (time-wise might work, but excessive memory consumption). I can expand on this under-alignment problem for high-divergence cases (in my tests ~8%) in another Issue if you are interested.
Regards Sivico
Hi @sivico26, I don't think the alignments recovered with -p 90
would be exactly a subset of the ones recovered with -p 85
, but I would expect a strong overlap between the two sets of alignments. Perhaps, have you already checked that?
Hi @AndreaGuarracino,
I have not run the set at -p 90
. And I think I won't. Simply because with -p 85
I am seeing that only 1% of the bases between two homolog chromosomes are aligning. So, based on your expectation of strong overlap (which I shared), it seems pointless to run the set -p 90
if it would yield around the same 1%.
Regarding question 3, we are working on improving the performance with lower identity thresholds. The bottleneck is the mapping phase, where the aligner (wfmash) has to find the homology map between all input sequences, given the input segment length and estimated identity threshold. Preliminary results are quite hot. Stay tuned, something could pop up soon (weeks / a very few months),
Hello,
I ran
pggb
(v0.4
from conda) to build a multispecies pangenome (see #226 for details). The command I used was the following:After taking around 27 days in the mapping step, it finished early after an error emerged calling
smoothxg
. These are the last lines ofinput.fasta.gz.ef37fbb.417fcdf.3bf9c7c.smooth.08-17-2022_11:09:02.log
Digging into this problem, I found that it was similar to #133 and #182. Following @AndreaGuarracino 's hint in those issues, I found out that the
seqwish
yielded.gfa
without P lines! (To be preciseinput.fasta.gz.ef37fbb.417fcdf.seqwish.gfa
).From the above, I have the following questions:
seqwish
could have produced such.gfa
? Is this plausible in some scenario and tell us something about the input alignments (.paf
) or on the contrary should be an impossibility?segment-length
I chose (200k) has something to do with the result. I also understand that lowering this parameter (to maybe 50k?) might help. However, I also understand that doing so would increase the compute time (especially in the mapping step), which is already quite long. Is there a way to address this problem resuming the pipeline where it is now (after the mapping step)? Or do you think it would be necessary to start from scratch? If the latter, I would appreciate some input regarding the second question of #226.Thank you for your time and for developing
pggb
. Sivico