pangenome / pggb

the pangenome graph builder
https://doi.org/10.1101/2023.04.05.535718
MIT License
355 stars 38 forks source link

How to handle when two contigs from the same assembly sligtly overlap #329

Open Isoris opened 11 months ago

Isoris commented 11 months ago

Hello,

Thank you for providing this tool.

I'm currently building a species pangenome using Illumina assemblies. However, I've noticed that some contigs appear to overlap or map to the same region without collapsing when I use odgi extract followed by viz.

Is there a method to collapse overlapping alignments into a single alignment for a sample?

Thank you for your assistance. SIKU01_000692_SIKU01_000693_706654_708999_Operon_34

SIKU01_000323_SIKU01_000329_302761_310606_Operon_321

Thank you for your answer. Quentin

subwaystation commented 11 months ago

Hi @Isoris,

odgi viz uses binning to visualize the graph. So a rough summary of the base pair picture is shown here. Did you take a look at such regions with https://github.com/chfi/waragraph? In the 1D viz, you can zoom in and verify, that indeed, the assemblies overlap and do not have SNPs. In the 2D viz, you can take a closer look at the nodes and the path positions. This might help you to get an idea of how to manually adjust your input sequences!

In PGGB, there is no such method to merge overlapping contigs. In odgi viz you can merge paths by prefixes with -M, but that is for visualization purposes only. As it may not be 100% accurate.

Isoris commented 11 months ago

Hi @subwaystation Thank you for your assistance,

In my bacterial dataset, comprising a minimal example of three bacteria, I observed the following results:

Without merging by prefixes, the visualization is as depicted here: viz_1

However, after merging the paths by prefixes, the graph alters to the representation shown below:

viz_2

My inquiry is: Is it feasible to extract the complete paths post-merge? Specifically, I am interested in obtaining the nodes and edges present on the left-hand side of the visual. My aim is to extract these "scaffolds" or paths, enabling me to subsequently remap my short reads onto them.

Ultimately, my goal is to produce a unified genome graph, as opposed to a fragmented genome graph. As evident from the left side of the visualization, the merged paths of blue and violet do share an overlap. This suggests that we possess the requisite positional or genomic context information. Using this, I hope to reconstruct a cohesive graph, wherein all nodes within this particular interval are interconnected, deriving from all the individual contigs.

I believe it is possible because I have a set of 80 samples of the same species from short read data and de-novo contigs and also 2 reference genomes.

I would be grateful for any suggestions. Quentin.

Isoris commented 11 months ago

For instance here for another subset of the bacteria of the same species. We obtain this:

combined_renamed Run_2 fasta gz f7ea872 417fcdf 483d7ba smooth final og lay draw_multiqc

Without merging by prefixes, the visualization is as depicted here:

viz_paths

However, after merging the paths by prefixes, the graph alters to the representation shown below:

viz_merged_paths

We can clearly see that it is theoretically possible to integrate the floating "subgraphs" to the main graph at least to merge some of them in a larger graph.

alarawms commented 11 months ago

Dear @Isoris , thanks, I have an inquiry, I ran the same issue, how could you merge paths by prefixes,

Isoris commented 11 months ago

Dear @Isoris , thanks, I have an inquiry, I ran the same issue, how could you merge paths by prefixes,

odgi viz -M

To merge prefixes.

If I remember correctly.. i will send you my code later in the afternoon.

Basically the prefixes have to be the same before the first separator.

alarawms commented 11 months ago

I did do it. it worked. it is just to have the input file or the samples identifier without the rest of the file, no # for the haplotype or anything. pass it as a text file contains list of names, each sample per line. and then the merging occur.

22-prefix is a text file with sample names, one sample per line.

odgi viz -M ../../22-prefix -i out.og -o out.og-m.png

Isoris commented 11 months ago

I did do it. it worked. it is just to have the input file or the samples identifier without the rest of the file, no # for the haplotype or anything. pass it as a text file contains list of names, each sample per line. and then the merging occur.

22-prefix is a text file with sample names, one sample per line.

odgi viz -M ../../22-prefix -i out.og -o out.og-m.png

Wow I never knew about that, thank you so much for the tips.