Open melop opened 1 month ago
I have seen this issue before and it can be caused by a couple different (but rare) situations, typically when a single gene is assigned to many many blocks (maybe the plants you are working with have gone through a lot of WGDs?), yet ploidy is set to 1. Is it possible that you have some genomes that are flagged as ploidy = 1 in init_genespace
but have several WGDs on the branch to other genomes in the phylogeny? If this is the case, try resetting the ploidy parameter to accurately reflect the WGDs in the phylogeny. If not, then I don't have a good solution to this problem, but we can probably get what you need out of the software - are you just hoping to get a riparian plot out of it, or do you want the pan-gene sets too?
I am hoping to use the pan-gene sets too. The situation is that WGD definitely happened but then the plants went through rediplodization.
I see. It might be worth reading the very last methods section in the genespace paper, which details why using deeply diverged genomes with histories of nested whole genome duplications can be challenging. The take-home is that even if your genomes have apparently diploidized, they almost always contain syntenic blocks from both homeologs. So, you have to set your ploidy to reflect the WGD history. For example, if you compare arabidopsis to common bean, which are both diploid species with haploid assemblies, you would treat common bean as 2x (1 WGD) and arabidopsis as 4x (two nested WGD) for syntenic comparisons. Indeed, you get 2x - 4x dotplots in this comparison.
In your case, I bet there are no genomes that are truly 1x relative to all the others. This makes the underlying graph structure very complex and causes this particular error. I have yet to be able to recreate it myself, but likely this is because I haven't tried as complex a run as you have tried here. So - long story short, I won't be able to resolve this issue. I'd suggest either running GENESPACE on smaller subsets of genomes, or to include a genome that is truly 1x in the run and give the ploidy as the phylogenetically expected copy number given WGDs in your set of genomes.
Oh, I also didn't see this note: Aquifoliales_Ilex_paraguariensis, Escalloniales_Escallonia_herrerae have < 75% of genes on chromosomes that contain > 10 genes. Synteny is not a useful metric for these genomes. Be very careful with your pan-gene sets. Camellia_lanceoleosa
... do you have some genomes that are not scaffolded?
Thank you for the explanation! Right - it seems that plants underwent many rounds of WGD and indeed most of these genomes are probably ancient tetraploids of some sort. It makes sense. Right, I noticed these notes too. After removing three species that are unscaffolded, the run was ok.
Good to hear!
Thank you for developing genespace! I have been using it in many projects. Previously when I ran a similar dataset it was fine, until I added another species, the following error happened:
`############################
These are 41 eudicot genomes that have a pretty deep divergence. Thank you for any pointers.