pangenome / pggb

the pangenome graph builder
https://doi.org/10.1101/2023.04.05.535718
MIT License
355 stars 38 forks source link

Problems with skipping the wfmash step #314

Closed Lucio-Yang closed 1 year ago

Lucio-Yang commented 1 year ago

Hi, I skipped the wfmash step using the -a parameter when running PGGB 0.4.1. Then my program has been running for a long time, and the log file has been in the same state and there is no output. The following is my code and log file, and the paf file is obtained by using minimap2. The input fasta file is only about 46 Mb. Can anyone help me with a solution to this problem? Thanks.

_minimap2 Oryza_sativa.chr1.fa Oryza_indica.chr1.fa > sative_indica.paf pggb -i Oryza_sativa.chr1.fa.gz -a sativaindica.paf -p 90 -n 8 -k 11 -P asm20 -o 0.03 -G 13033,13177 -t 48 -V sativa# -o out -m

log file: [seqwish] WARNING: input alignment file sativa_indica.paf does not have CIGAR strings. The resulting graph will only represent the input sequences. [seqwish::seqidx] 0.002 indexing sequences [seqwish::seqidx] 0.503 index built [seqwish::alignments] 0.503 processing alignments [seqwish::alignments] 0.512 indexing [seqwish::alignments] 0.512 index built [seqwish::transclosure] 0.520 computing transitive closures [seqwish::transclosure] 0.534 0.00% 0-10000000 overlap_collect

ekg commented 1 year ago

You will probably need to complete some filtering on your minimap2 output.

wfmash exists in part because minimap2 does not make it easy to get parsimonious N:N alignments with maximal lengths. You will tend to get many short alignments between transposons and other repeats. If these are not filtered, the resulting graph (in seqwish) will be very complex, and take a long time to build even with relatively small inputs.

I would suggest that you spend some time looking at the alignments you've generated using pairwise alignment visualizers. The alignments should be 1:1 as much as possible. A scattering of many short matches will make the resulting graph very hard to understand and work with.

Is there a reason you cannot use wfmash? It is specifically designed to align whole T2T chromosome assemblies together, and it is oriented towards building pangenome graphs.

Lucio-Yang commented 1 year ago

Thank you very much for your reply!

The reason for not using wfmash is that we want to compare the impact of different alignment algorithms on the construction of a graph pan-genome.

What is puzzling is that after converting the maf file obtained by Anchorwave to paf(maf-convert.py+paftools.js), it is also used as the input file of PGGB, and the state of the program is the same as that of minimap2(running for a long time, and the log file has been in the same state and there is no output.). The paf file obtained by using Anchorwave has only 5 comparison results and all of them are long matches, so what should be the reason?

ekg commented 1 year ago

In that case, my advice may be inaccurate.

Are you running seqwish in a directory that is backed by a local disk, ideally a solid state SSD? If not, then the runtime can be very slow. It is worth profiling by looking at htop and seeing if the CPU usage is high. If it is high, then things are OK, if not there may be I/O problems.

If you do not have one available, then you can mount a ramdisk and work in that directory.

ekg commented 1 year ago

A comment on your command line:

-k 11 -P asm20 -o 0.03 -G 13033,13177

I would suggest using the default values here. Simply remove these arguments.

If you want seqwish to run faster, increasing the -k filter setting can help. For larger genomes, we see things get easy at around -k 79, and -k 127, although shorter lengths like -k 47 are good too. The default -k 19 can work, but in high-copy repeats the seqwish process can slow down. -k 11 can be very hard in large genomes with complex repeats.

Do you have 8 haplotypes in the input?

Lucio-Yang commented 1 year ago

I ran the program on HPC, and there should be no shortage of computing resources. In addition, I removed all the extra parameters as you said but still have the same problem as before, It has been running for 18 hours since yesterday.

pggb -i Oryza_sativa.chr1.fa.gz -a sativa_indica.paf -n 2 -k 127 -t 48 -V sativa# -o out -m

[seqwish::seqidx] 0.004 indexing sequences [seqwish::seqidx] 0.573 index built [seqwish::alignments] 0.573 processing alignments [seqwish::alignments] 1.192 indexing [seqwish::alignments] 1.194 index built [seqwish::transclosure] 1.199 computing transitive closures [seqwish::transclosure] 1.213 0.00% 0-10000000 overlap_collect

Lucio-Yang commented 1 year ago

I haven’t quite understood what this haplotype refers to. What I understand is the number of chromosomes. My input file is the first chromosome of two species. I set it to -n 2.

Lucio-Yang commented 1 year ago

Problem solved. The reason is that I didn’t merge the two fasta files, thanks again for your reply.

subwaystation commented 1 year ago

Happy to your that!