pangenome / pggb

the pangenome graph builder
https://doi.org/10.1038/s41592-024-02430-3
MIT License
369 stars 40 forks source link

smooth maf output file question #106

Closed userzxyz closed 2 years ago

userzxyz commented 3 years ago

Hello,

I am trying to understand MAF format from smooth.maf file. I have 18 genomes and created graphs for each of the chromosomes. Here are some example lines from smooth.maf output file for one of the chromosomes:

a blocks=12-13 loops=false merged=true below_thresh=true
s Consensus_12-13     0 7250 +     7250 CCCTCCTACTCATCGGGGCCTGGCACTTGCCCCGACGGCCGGGTGTAGGTCGCGCGCTTAAGCGCCATCCATTTTCGGGGCTAGTTGATTCGGCAGGTGAGTTGTTACACATTCCTTAGCGGA
s Sample2 0 7250 + 5243748              CCCTCCTACTCATCGGGGCCTGGCACTTGCCCCGACGGCCGGGTGTAGGTCGCGCGCTTAAGCGCCATCCATTTTCGGGGCTAGTTGATTCGGCAGGTGAGTTGTTACACATTCCTTAGCGGA

From the documentation for MAF files, I understand that this paragraph represents a set of multiple alignment. But I could not find a documentation for Consensus. Does it means for block 12_13, only Sample_2 aligns same? Thank you for any help!

ekg commented 3 years ago

The consensus is automatically generated in the POA step. It is the "heaviest bundle" of the MSA.

We have previously produced a consensus graph that consists of these segments only and links between them greater than a given length (-C). But this is disabled by default at the moment because the algorithm needs work to produce correct output.

On Fri, Jun 4, 2021, 21:59 userzxyz @.***> wrote:

Hello,

I am trying to understand MAF format from smooth.maf file. I have 18 genomes and created graphs for each of the chromosomes. Here are some example lines from smooth.maf output file for one of the chromosomes:

a blocks=12-13 loops=false merged=true below_thresh=true s Consensus_12-13 0 7250 + 7250 CCCTCCTACTCATCGGGGCCTGGCACTTGCCCCGACGGCCGGGTGTAGGTCGCGCGCTTAAGCGCCATCCATTTTCGGGGCTAGTTGATTCGGCAGGTGAGTTGTTACACATTCCTTAGCGGA s Sample2 0 7250 + 5243748 CCCTCCTACTCATCGGGGCCTGGCACTTGCCCCGACGGCCGGGTGTAGGTCGCGCGCTTAAGCGCCATCCATTTTCGGGGCTAGTTGATTCGGCAGGTGAGTTGTTACACATTCCTTAGCGGA

From the documentation for MAF files, I understand that this paragraph represents a set of multiple alignment. But I could not find a documentation for Consensus. Does it means for block 12_13, only Sample_2 aligns same? Thank you for any help!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/106, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEOSA5ZQJ4CHTJR2GMDTREWATANCNFSM46DNMCOQ .

userzxyz commented 3 years ago

Thank you! I was trying to use MAF output to confirm that the graph generated is depicting real insertions/deletions, I wanted to extract sequences for a sample from a particular genomic region where we know there is an insertion/deletion for that sample. For example, there is a big deletion at chromosome 2: 100456-101675 for sample1 which is one of the samples in graph genome. How can I relate this information to the graph to confirm there actually is a deletion in that region? Thank you!

ekg commented 3 years ago

I suggest using VG deconstruct to get this kind of information from the graph.

On Tue, Jun 8, 2021, 02:22 userzxyz @.***> wrote:

Thank you! I was trying to use MAF output to confirm that the graph generated is depicting real insertions/deletions, I wanted to extract sequences for a sample from a particular genomic region where we know there is an insertion/deletion for that sample. For example, there is a big deletion at chromosome 2: 100456-101675 for sample1 which is one of the samples in graph genome. How can I relate this information to the graph to confirm there actually is a deletion in that region? Thank you!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/106#issuecomment-856347817, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQELNEGFNWJWEWANGVL3TRVPEVANCNFSM46DNMCOQ .

userzxyz commented 3 years ago

Thank you @ekg! I noticed that pggb graph arranges the sample in alphabetical order. I tried to rearrange the sample order in the step while making chromosome wise files. But the pggb graph again was in alphabetical order. Is there any option to rearrange the sample order?

ekg commented 3 years ago

In the graph, it should be in the input order in the FASTA. Are you sure that order isn't being respected?

In the MAF the order may be alphabetical though.

On Thu, Jun 10, 2021, 18:20 userzxyz @.***> wrote:

Thank you @ekg https://github.com/ekg! I noticed that pggb graph arranges the sample in alphabetical order. I tried to rearrange the sample order in the step while making chromosome wise files. But the pggb graph again was in alphabetical order. Is there any option to rearrange the sample order?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/106#issuecomment-858761612, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEOJDXEKCSEJEVCJ6GTTSDQ4DANCNFSM46DNMCOQ .

userzxyz commented 3 years ago

I am currently running another job for the same and will update how it goes.

I want to ask that I used deconstructas per your suggestion: vg deconstruct sample.xg -g sample.gbwt > sample_deconstruct.vcf

I made sample.gbwt as: vg gbwt -G graph.fixed.gfa -p -o sample.gbwt

and sample.xg as: vg convert -x -g graph.fixed.gfa > sample.xg I am trying to understand the output vcf format. How can I get rid of Consensus as the sample names in the vcf header. I only want to keep the sample names that I used in graph construction.

subwaystation commented 3 years ago

If you want to remove the Consensus sample names, you have to remove these from the final smoothed GFA. We do this now by default, when we call vg deconstruct. Please see https://github.com/pangenome/pggb/blob/c1886f8ce3c6bb229530130694ee14b323d57c53/pggb#L484.

subwaystation commented 3 years ago

@userzxyz Were you able to solve your problem?