Closed brigidar closed 2 months ago
Hi @brigidar, thank you for your feedback. I just tried re-running the steps from the tutorial with the E. coli genomes, and after the export command this is the structure of the folder I obtain:
ecoli_export
└── vis
├── coreGenomeTree.json
├── core_na_reduced.fa.gz
├── geneCluster
├── geneCluster.json
├── metaConfiguration.js
├── metainfo.tsv
├── strainMetainfo.json
└── strain_tree.nwk
So at least on these data it seems to work as expected. Note that the strain_tree.nwk
file is not in the geneCluster
folder but in the parent vis
folder.
If on your side this is still not working could you maybe share your data? This way I can test on it and see at what point it is failing.
Best, Marco
Hi Marco, When I run the example data it works fine. I think it might be something with maybe the polish command. I tried to do it from scratch and now it gets killed at that step and the initial build file is bigger than before. Is it necessary to do the polish to get the strain_tree.nwk or can I just export directly after the build?
Hi brigidar, yes that step can be skipped, it's only used to further improve the quality of multiple sequence alignments but for 15 it might be not needed. Another reason I can think of for the missing tree might be if pangraph cannot find any core genome. If you isolates are diverse enough that pangraph cannot find any single-copy block present in everyone then it cannot generate a core-genome alignment and build the tree, but I don't think this should be the case...
So I tried again and I have the same issue. When I look at the graph in Bandage it looks very similar to the e coli. There is a lot of the loops that are shared by all of the 15 genomes. Not sure how I would be able to assess if there is not enough to generate the core-genome alignment. Is there a flag I can look for? I am attaching the graph I am getting.
Indeed, this looks like a typical bacterial genome graph. Another possibility that comes to mind is whether fasttree
is installed and working properly, see optional dependencies. You could try typing fasttree -help
in the environment where you're running pangraph to see if this command is maybe missing on path.
Alternatively if you're using publicly-available data if you want you could give me the accession numbers of your genomes and I can try to see if I can reproduce the issue on my side.
Thanks for taking the time to test this issue!
I have a mix of deposited and sequenced data. I am attaching the accession numbers of the deposited ones. I did not try to run it just with them, so not sure how it looks if it's just those 10 genomes. The fasttree shows up as running just fine. I am running it on a server that has some write and read restrictions so not sure if at some point that could cause a problem. I could not find a way to make it verbose to see where it might be failing. I can also share the json file if that might help through box.
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
GCA_000706685.1 | ASM70668v1 -- | -- GCA_000709415.1 | ASM70941v1 GCA_000953575.1 | Staphylococcus xylosus C2a GCA_002078255.1 | ASM207825v1 GCA_007814625.1 | ASM781462v1 GCA_020229655.1 | ASM2022965v1 GCA_020229695.1 | ASM2022969v1 GCA_020229715.1 | ASM2022971v1 GCA_023716225.1 | ASM2371622v1
I am trying to create the pangenome from 15 genomes of S. xylosus. I can create all the files following the tutorial, but when I do the export to generate files for panX I am missing the strain_tree.nwk file. I have all the files in the geneCluster folder. Is there an additional step that needs to be done to get the tree? Do I need to add metadata to get it?