jtlovell / GENESPACE

Other
191 stars 27 forks source link

Error in step 8 of run_genespace() #171

Open melop opened 1 month ago

melop commented 1 month ago

Thank you for developing genespace! I have been using it in many projects. Previously when I ran a similar dataset it was fine, until I added another species, the following error happened:

`############################

  1. Constructing syntenic pan-gene sets ... WARNING: genomes Aquifoliales_Ilex_paraguariensis, Escalloniales_Escallonia_herrerae have < 75% of genes on chromosomes that contain > 10 genes. Synteny is not a useful metric for these genomes. Be very careful with your pan-gene sets. Camellia_lanceoleosa : Error in vecseq(f, len, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice. Calls: run_genespace ... merge -> merge.data.table -> [ -> [.data.table -> vecseq Execution halted `

These are 41 eudicot genomes that have a pretty deep divergence. Thank you for any pointers.

LovellHAGSC commented 1 month ago

I have seen this issue before and it can be caused by a couple different (but rare) situations, typically when a single gene is assigned to many many blocks (maybe the plants you are working with have gone through a lot of WGDs?), yet ploidy is set to 1. Is it possible that you have some genomes that are flagged as ploidy = 1 in init_genespace but have several WGDs on the branch to other genomes in the phylogeny? If this is the case, try resetting the ploidy parameter to accurately reflect the WGDs in the phylogeny. If not, then I don't have a good solution to this problem, but we can probably get what you need out of the software - are you just hoping to get a riparian plot out of it, or do you want the pan-gene sets too?

melop commented 1 month ago

I am hoping to use the pan-gene sets too. The situation is that WGD definitely happened but then the plants went through rediplodization.

LovellHAGSC commented 1 month ago

I see. It might be worth reading the very last methods section in the genespace paper, which details why using deeply diverged genomes with histories of nested whole genome duplications can be challenging. The take-home is that even if your genomes have apparently diploidized, they almost always contain syntenic blocks from both homeologs. So, you have to set your ploidy to reflect the WGD history. For example, if you compare arabidopsis to common bean, which are both diploid species with haploid assemblies, you would treat common bean as 2x (1 WGD) and arabidopsis as 4x (two nested WGD) for syntenic comparisons. Indeed, you get 2x - 4x dotplots in this comparison.

image
LovellHAGSC commented 1 month ago

In your case, I bet there are no genomes that are truly 1x relative to all the others. This makes the underlying graph structure very complex and causes this particular error. I have yet to be able to recreate it myself, but likely this is because I haven't tried as complex a run as you have tried here. So - long story short, I won't be able to resolve this issue. I'd suggest either running GENESPACE on smaller subsets of genomes, or to include a genome that is truly 1x in the run and give the ploidy as the phylogenetically expected copy number given WGDs in your set of genomes.

LovellHAGSC commented 1 month ago

Oh, I also didn't see this note: Aquifoliales_Ilex_paraguariensis, Escalloniales_Escallonia_herrerae have < 75% of genes on chromosomes that contain > 10 genes. Synteny is not a useful metric for these genomes. Be very careful with your pan-gene sets. Camellia_lanceoleosa ... do you have some genomes that are not scaffolded?

melop commented 1 month ago

Thank you for the explanation! Right - it seems that plants underwent many rounds of WGD and indeed most of these genomes are probably ancient tetraploids of some sort. It makes sense. Right, I noticed these notes too. After removing three species that are unscaffolded, the run was ok.

LovellHAGSC commented 3 weeks ago

Good to hear!