pangenome / pggb

the pangenome graph builder
https://doi.org/10.1038/s41592-024-02430-3
MIT License
369 stars 40 forks source link

how to handle the assembled contigs that cannot be aligned to the backbone reference genome? #176

Open biozzq opened 2 years ago

biozzq commented 2 years ago

Dear all,

When generating graph genome according to the pipeline https://github.com/pangenome/HPRCy1, I found that some long assembled contigs (generated using hifisam) can not be aligned to the reference genome. Thus, if we subset the contigs by reference chromosome, these unmapped contigs will be lost in the final graph genome. How do you think about this problem? Any suggestions and comments are welcome, thank you in advance.

Sincerely, Zheng zhuqing

ekg commented 2 years ago

If I remember correctly, there is a newer version of that workflow which uses more sensitive split remapping to attempt to collect these contigs as well. https://github.com/pangenome/HPRCyear1v2genbank

You may find this helps to collect the rest. In my memory, the unplaced contigs are a tiny fraction of the data, and few map even with sensitive splitting.

It is possible to map the unplaced contigs against all contigs in each chromosome partition.

We are working on a preprocessing script that uses all to all mapping followed by Louvain community detection to drive the partitioning. This should become the preferred partitioning method. @AndreaGuarracino what's the status of this?

AndreaGuarracino commented 2 years ago

@biozzq, I've prepared a tutorial on sequence partitioning which will guide you to use the scripts provided in the pggb repo. It applies the Leiden algorithm to identify the communities in the all-to-all mappings computed with wfmash. The guide also shows how to generate partitioned FASTA files from the communities detected.

Let us know if it works decently for you. Any feedback would be appreciated.

biozzq commented 2 years ago

Dear @ekg @AndreaGuarracino

Thank you for your prompt reply. I have tried the tutorial provided by @AndreaGuarracino , it improved a lot and only a small part of contigs can not be aligned in the all to all mapping. I wonder that if we can pool these unmapped contigs into one community, and then process with pggd, and finally add it to the graph genome.

Best regards, Zheng zhuqing

biozzq commented 2 years ago

Dear all,

Thank you for your prompt reply. I have tried the tutorial provided by @AndreaGuarracino , it improved a lot and only a small part of contigs can not be aligned in the all to all mapping. I wonder that if we can pool these unmapped contigs into one community, and then process with pggd, and finally add it to the graph genome.

Sorry, I would like to follow up making sure you got my previous issue.

Best regards,

Zheng zhuqing

ekg commented 2 years ago

Andrea has been working through an approach based on community detection on graphs (Louvain) that should be able to partition even these, so long as they can map among each other.

The longer term solution is to cluster assembly graph components, and also to take assembly graphs as input.

On Sat, Apr 9, 2022, 06:45 biozzq @.***> wrote:

Dear all,

Thank you for your prompt reply. I have tried the tutorial provided by @AndreaGuarracino https://github.com/AndreaGuarracino , it improved a lot and only a small part of contigs can not be aligned in the all to all mapping. I wonder that if we can pool these unmapped contigs into one community, and then process with pggd, and finally add it to the graph genome.

Sorry, I would like to follow up making sure you got my previous issue.

Best regards,

Zheng zhuqing

— Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/176#issuecomment-1093685203, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQENU6MEGH3F5HX33VW3VEEDNBANCNFSM5RZ6ZBEA . You are receiving this because you were mentioned.Message ID: @.***>

biozzq commented 2 years ago

Dear @ekg

I tried the approach based on communities. My focused species has 20 chromsomes (18 autosomes plus an X and a Y chromosome) and the backbone reference genome contains 20 chromosomes and ~600 scaffolds. After finishing the community detection, a total of 136 communities have been detected, and each chromosome has its own community. There are two issues that I would like to ask for your help. 1, I found some backbone scaffolds are in the same community with the chromosome, I wonder that if I can remove these scaffolds from this community before running pggb. 2, how to deal with the communities that contain only scaffolds and those without any backbone sequences?

Thank you in advance.

Best regards, Zheng zhuqing