neherlab / pangraph

A bioinformatic toolkit to align genome assemblies into pangenome graphs
https://neherlab.github.io/pangraph
MIT License
77 stars 7 forks source link

Export not generating final tree #78

Closed brigidar closed 2 months ago

brigidar commented 2 months ago

I am trying to create the pangenome from 15 genomes of S. xylosus. I can create all the files following the tutorial, but when I do the export to generate files for panX I am missing the strain_tree.nwk file. I have all the files in the geneCluster folder. Is there an additional step that needs to be done to get the tree? Do I need to add metadata to get it?

mmolari commented 2 months ago

Hi @brigidar, thank you for your feedback. I just tried re-running the steps from the tutorial with the E. coli genomes, and after the export command this is the structure of the folder I obtain:

ecoli_export
└── vis
    ├── coreGenomeTree.json
    ├── core_na_reduced.fa.gz
    ├── geneCluster
    ├── geneCluster.json
    ├── metaConfiguration.js
    ├── metainfo.tsv
    ├── strainMetainfo.json
    └── strain_tree.nwk

So at least on these data it seems to work as expected. Note that the strain_tree.nwk file is not in the geneCluster folder but in the parent vis folder.

If on your side this is still not working could you maybe share your data? This way I can test on it and see at what point it is failing.

Best, Marco

brigidar commented 2 months ago

Hi Marco, When I run the example data it works fine. I think it might be something with maybe the polish command. I tried to do it from scratch and now it gets killed at that step and the initial build file is bigger than before. Is it necessary to do the polish to get the strain_tree.nwk or can I just export directly after the build?

mmolari commented 2 months ago

Hi brigidar, yes that step can be skipped, it's only used to further improve the quality of multiple sequence alignments but for 15 it might be not needed. Another reason I can think of for the missing tree might be if pangraph cannot find any core genome. If you isolates are diverse enough that pangraph cannot find any single-copy block present in everyone then it cannot generate a core-genome alignment and build the tree, but I don't think this should be the case...

brigidar commented 2 months ago

So I tried again and I have the same issue. When I look at the graph in Bandage it looks very similar to the e coli. There is a lot of the loops that are shared by all of the 15 genomes. Not sure how I would be able to assess if there is not enough to generate the core-genome alignment. Is there a flag I can look for? I am attaching the graph I am getting. 20240515_prepolish

mmolari commented 2 months ago

Indeed, this looks like a typical bacterial genome graph. Another possibility that comes to mind is whether fasttree is installed and working properly, see optional dependencies. You could try typing fasttree -help in the environment where you're running pangraph to see if this command is maybe missing on path. Alternatively if you're using publicly-available data if you want you could give me the accession numbers of your genomes and I can try to see if I can reproduce the issue on my side.

Thanks for taking the time to test this issue!

brigidar commented 2 months ago

I have a mix of deposited and sequenced data. I am attaching the accession numbers of the deposited ones. I did not try to run it just with them, so not sure how it looks if it's just those 10 genomes. The fasttree shows up as running just fine. I am running it on a server that has some write and read restrictions so not sure if at some point that could cause a problem. I could not find a way to make it verbose to see where it might be failing. I can also share the json file if that might help through box.

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

GCA_000706685.1 | ASM70668v1 -- | -- GCA_000709415.1 | ASM70941v1 GCA_000953575.1 | Staphylococcus xylosus C2a GCA_002078255.1 | ASM207825v1 GCA_007814625.1 | ASM781462v1 GCA_020229655.1 | ASM2022965v1 GCA_020229695.1 | ASM2022969v1 GCA_020229715.1 | ASM2022971v1 GCA_023716225.1 | ASM2371622v1

mmolari commented 2 months ago

I tried downloading the 9 records below from genbank, building a pangenome graph with pangraph and exporting the results for PanX. Everything seems to work as expected, and this produced the strain_tree.nwk file.

image

One thing I had to take care of was removing plasmids from the fasta files downloaded from genbank. Some files had plasmid records together with the chromosome record, and these need to be excluded to build the graph for the chromosome.

As you suggest, this might be due to restrictions on the server or other issues with memory limit. If you have a machine that supports pangraph you can also try to run the export command locally on your machine, 15 isolates are few enough that it should be done in few minutes. Or otherwise if you want to share the json file produced by pangraph I can run the command for you. Just note that sharing this file is equivalent to sharing the genomes since they can be reconstructed from the graph, so if this is data you'd rather not share we can find another solution.

brigidar commented 2 months ago

The ones I sequenced I removed the plasmids. I didn't check the ones from NCBI. That might be the issue. I will remove them and try again and see if that solves the problem. I can try to run just the export locally. It was failing due to memory on the previous parts. Thanks a lot. I will keep you posted.

brigidar commented 2 months ago

The plasmids in the references were the issue. Thanks for helping out!