pangenome / pggb

the pangenome graph builder
https://doi.org/10.1101/2023.04.05.535718
MIT License
355 stars 38 forks source link

Construct PanGenome #331

Closed Tonitsk8264 closed 11 months ago

Tonitsk8264 commented 11 months ago

I want to use pggb to construct a pan-genome of Macaca genus using 36 genome data.I merged the 36 genomes intoinput.fa using the cat command and used samtools faidx to build the index. However, many similar warnings appeared as follows:

[W::fai_insert_index] Ignoring duplicate sequence "Contig1787" at byte offset 85163274466
[W::fai_insert_index] Ignoring duplicate sequence "Contig1788" at byte offset 85163280643
...
...

Can you provide me with some advice regarding these warnings? Also can you give me some advice on the parameters to use to construct the Macaca pan-genome? The command is as follows

cat 1.fa 2.fa ... > input.fasta
samtools faidex input.fasta
pggb -i input.fasta  -p 98 -s 50000 -n 35 --skip-viz  -o macaca -t 16 -k 79
AndreaGuarracino commented 11 months ago

You have to avoid sequence name collisions. A way is to add a prefix that is specific to each sample. See this tutorial for a way to do that.

About the parameters, just start with the default values, so just.

pggb -i input.fasta -n 36 -o output_folder -t 16

You have 36 genomes, so -n has to be 36.

Tonitsk8264 commented 11 months ago

Appreciate your prompt response

Tonitsk8264 commented 11 months ago

Sorry to bother you again, I used the following command to build a pan genome (version 0.5.4 for pggb and 0.10.3 for wfmash) pggb - i input.fa.gz - p 95- n 36- s 50000- k 79- o macaca - D macaca. tmp - t 16-- keep temp files -- multiqc - S - r > macaca.log 2> macaca. err The log file reported an error as follows

What(): stoi
Command terminated by signal 6
Wfmash - s 50000- l 250000- p 95- n 35- k 19- H 0.001- X - t 16-- tmp base macaca.tmp input.fa.gz - i macaca.tmp/input.fa.gz.3559691.mappings.wfmash.paf -- invert filtering
1.94s user 3.07s system 313% CPU 1.60s total 139220Kb max memory

Why does the parameter wfmash - n in the log file change to 35

When I used wfmash - h to view its parameters, I found that the - n parameter means the following -n [N], -- num mappings for segment=[N] Number of mappings to retain for each segment [default: 1]

Can you help me solve my confusion

AndreaGuarracino commented 11 months ago

Your command line is strange. It should be

pggb -i input.fa.gz -p 95 -n 36 -s 50000 -k 79 -o macaca -D macaca.tmp -t 16 --keep-temp-files --multiqc -S -r > macaca.log 2> macaca.err

-n is the number of haplotypes. You want to align each haplotype against the other n-1 ones, that's why we specify -n 35 in wfmash. Be aware that this behavior changed in the latest pggb, where now we always give -n 1 to wfmash, but the concept holds: each haplotype is aligned against the other n-1.

Tonitsk8264 commented 11 months ago

Sorry, the command was because of a formatting error when pasting it over. I wonder if the reason it didn't run pggb successfully is because the header of input.fa.gz contains not only the id but also other information like this >Msylv#1#scaffold5 Len=47473665, I've changed the header line now and I wonder if it will run successfully.

ekg commented 11 months ago

Make sure you rebuild the fasta index after making changes to the headers.

On Sun, Oct 8, 2023, 21:13 Tonitsk8264 @.***> wrote:

Sorry, the command was because of a formatting error when pasting it over. I wonder if the reason it didn't run pggb successfully is because the header of input.fa.gz contains not only the id but also other information like this >Msylv#1#scaffold5 Len=47473665, I've changed the header line now and I wonder if it will run successfully.

— Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/331#issuecomment-1752259778, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEPQYHVBC5J2MRONXKDX6NMS3AVCNFSM6AAAAAA5KK7A4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJSGI2TSNZXHA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Tonitsk8264 commented 11 months ago

Thank you very much for your reply. Currently there are no problems using pggb to build pan-genome, but it seems to take a long time.