Error when ignoring multiple genomes

hanneskramml commented 1 year ago

Hi John,

Is the program designed to support more than one outgroup / genome that can be ignored?

To study polyploid CAM plants, I'd like to perform an orthofinder run on selected C3/CAM species in all angiosperms, but use Genespace only for the eudicots. Therefore, I'd like to exclude all monocots, but the following error occurs as soon as I specify more than one genome via ignoreTheseGenomes:

GENESPACE v1.1.4: synteny and orthology constrained comparative genomics

Checking Working Directory ... PASS: `pangenome`
Checking user-defined parameters ...
        Genome IDs & ploidy ... Error in if (is.na(outgroup)) { : the condition has length > 1

Calls: init_genespace
In addition: Warning message:
In !is.na(outgroup) && any(outgroup %in% genomeIDs) :
  'length(x) = 5 > 1' in coercion to 'logical(1)'

Execution halted

Many thanks in advance!

Best, Hannes

jtlovell commented 1 year ago

it should be able to take many outgroups. this is a bug that I will address.

That said, adding diverse genomes to your orthofinder run can have major consequences for what genes are orthologs. For example, keeping a monocot and a dicot will mean that all gene families that diverged as a result of the ancient eudicot WGT event will be orthologs.

I should have an updated release today/tonight.

hanneskramml commented 1 year ago

Thanks a lot. Really appreciate!

jtlovell commented 1 year ago

OK - I pushed v1.1.9, which should have this fix. It worked on the example data and one other set. Let me know how it goes. Once I have confirmation that this works for you (and the other bug fixes work), I'll make a release.

Please update your install with:

detach("package:GENESPACE", unload = TRUE)
devtools::install_github("jtlovell/GENESPACE", upgrade = F)
library(GENESPACE)

hanneskramml commented 1 year ago

Thanks! Seems to be fixed. However, in the meantime I also did a run for the eudicots only and the following error occurs (v1.1.8):

5. Building synteny-constrained orthogroups ...
        ##############
        Running Orthofinder within syntenic regions
        ...Clusia_rosea v. Clusia_rosea:                                   n syn / syn OG / inblk OG hits = 395343 / 323824 / 329659
        ...Clusia_minor v. Clusia_minor:                                   n syn / syn OG / inblk OG hits = 251832 / 215687 / 219885
        ...Clusia_rosea v. Clusia_minor:                                   n syn / syn OG / inblk OG hits = 244054 / 177129 / 186262
        ...Clusia_rosea v. Clusia_multiflora:                              n syn / syn OG / inblk OG hits = 162815 / 122694 / 129447
        ...Clusia_minor v. Clusia_multiflora:                              n syn / syn OG / inblk OG hits = 126516 / 98548 / 104762
        ...Clusia_multiflora v. Clusia_multiflora:                         n syn / syn OG / inblk OG hits = 125509 / 109090 / 110949
        ...Clusia_rosea v. Kalanchoe_laxiflora:                            Error in rbindlist(mclapply(1:nrow(ofDirs), mc.cores = nCores, function(k) { : 
  Item 1 of input is not a data.frame, data.table or list
Calls: run_genespace ... with.default -> eval -> eval -> ofInBlk_engine -> rbindlist
In addition: Warning message:
In mclapply(1:nrow(ofDirs), mc.cores = nCores, function(k) { :
  all scheduled cores encountered errors in user code
Execution halted

Hope that tells you something. I will also rerun it with the new version...

jtlovell commented 1 year ago

This won't be fixed going from v1.1.8--.9 I'm not sure what the issue is ... my guess is there isn't much synteny between Clusia and Kalanchoe. Would you mind reaching out via email (jlovell@hudsonalpha.org)? I may need your input files to replicate the error. Thanks! John

jtlovell commented 1 year ago

also, you can always set init_genespace(..., orthofinderInBlk = FALSE) ... that should run through no prob.

jtlovell commented 1 year ago

Alright - you can try to update to v1.1.10 and see if there is a more informative error. This will tell you if a pair of genome do not have enough synteny, and if so, won't try to do inBlk orthofinder.

hanneskramml commented 1 year ago

Amazingly nice. You're totally right, they have almost no synteny and get skipped in your new release. Thanks a lot for your effort!

On the other hand, my polyploid genomes of interest (Clusia) are no longer running orthofinder either. This is very likely due to their assemblies. I've only resolved one of them at chromosome level - and just the haploid representation but am using the full diploid(ized) genome as input. The other two are polyploid draft versions, largely at contig level. Any chance to deal with that, e.g. set/adjust the thresholds?

Think I'd greatly benefit from your timely method to examine functional variants within syntenic orthologs. Thus, I'm mainly interested in the pan-genome framework rather than riparian plotting on a macro-synteny scale. However, since most reference species are not closely related, perhaps I should split the process and run Genespace only with a more synteny-based species selection...

I will also send you the logs, as there was an error at the very end when building phased blocks. Thanks again, John!

7. Final block coordinate calculation and riparian plotting ...
        ##############
        Calculating syntenic blocks by reference chromosomes ... 
                n regions (aggregated by 25 gene radius): 170389
                n blocks (collinear sets of > 5 genes): 186409
        **WARNING**: genomes Kalanchoe_fedtschenkoi,
                Kalanchoe_laxiflora, Sedum_album have > 100 chromosomes
                in the synteny map. This is too complex to make
                riparian plots.
        ##############
        Building ref.-phased blks and riparian plots for haploid genomes:
                Mesembryanthemum_cristallinum: 117528 phased blocks
Error in hclust(di, method = "ward.D2") : 
  must have n >= 2 objects to cluster
Calls: run_genespace ... pull_synChrOrd -> [ -> [.data.table -> ct_clust -> hclust
Execution halted

jtlovell commented 1 year ago

OK - thanks for the heads up. I pushed a patch that may work ... although it may not. I think you are getting to the limits of what GENESPACE can handle here. If you could try to run again, installed from master, and lmk if you get the error, that would be great. Regardless, I think the approach you propose is a good one - feel free to use orthofinder across the whole set, but only use GENESPACE for distinct subsets that should have synteny.

hanneskramml commented 1 year ago

The previous error seems to be fixed but both runs produce new ones. At least the dataset for eudicots made it to the pan-genome construction. Pls. find the logs attached. In the meantime, I'll try the proposed approach, but would be very grateful if you could provide a way to rerun orthofinder within syntenic regions for the polyploid Clusia genomes, e.g. via a config to manually set the filter to less than 50%. Thanks a lot!!

Pangenome/angiosperms

7. Final block coordinate calculation and riparian plotting ...
        ##############
        Calculating syntenic blocks by reference chromosomes ... 
                n regions (aggregated by 25 gene radius): 177429
                n blocks (collinear sets of > 5 genes): 193633
        **WARNING**: genomes Kalanchoe_fedtschenkoi,
                Kalanchoe_laxiflora, Sedum_album have > 100 chromosomes
                in the synteny map. This is too complex to make
                riparian plots.
        ##############
        Building ref.-phased blks and riparian plots for haploid genomes:
                Mesembryanthemum_cristallinum: 123146 phased blocks
                Arabidopsis_thaliana         : 85280 phased blocks
                Manihot_esculenta            : 199009 phased blocks
Error in rbindlist(mclapply(synhitFiles, mc.cores = nCores, function(j) { : 
  Item 10 of input is not a data.frame, data.table or list
Calls: run_genespace -> plot_riparian -> phase_blks -> rbindlist
In addition: Warning messages:
1: In `[.data.table`(chro, , `:=`(synClus = ct_clust(ord = meanPos,  :
  Group 11 column 'synClus': 0.502146 (type 'double') at RHS position 1 truncated (precision lost) when assigning to type 'integer'
2: In mclapply(synhitFiles, mc.cores = nCores, function(j) { :
  scheduled core 10 encountered error in user code, all values of the job will be affected
Execution halted

Eudicots

8. Constructing syntenic pan-gene sets ...
        **WARNING**: genomes Sedum_album have < 75% of genes on
                chromosomes that contain > 10 genes. Synteny is not a
                useful metric for these genomes. Be very careful with
                your pan-gene sets.
        Clusia_multiflora            : Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.
Calls: run_genespace ... merge -> merge.data.table -> [ -> [.data.table -> vecseq
In addition: Warning messages:
1: In `[.data.table`(chro, , `:=`(synClus = ct_clust(ord = meanPos,  :
  Group 10 column 'synClus': 0.502439 (type 'double') at RHS position 1 truncated (precision lost) when assigning to type 'integer'
2: In `[.data.table`(chro, , `:=`(synClus = ct_clust(ord = meanPos,  :
  Group 10 column 'synClus': 0.505747 (type 'double') at RHS position 1 truncated (precision lost) when assigning to type 'integer'
Execution halted

jtlovell commented 1 year ago

Re: 50% synteny threshold ... GENESPACE should only be used when you have very good synteny between genomes. Otherwise, you'll be much better served just using raw OrthoFinder output. Synteny constraints will be problematic and produce erroneous results when the majority of the genome is in tiny collinear segments. Its for this reason that we institute the threshold.

Re: Eudictots error. Which genomes did you use there? That looks like a bug that I should address.

hanneskramml commented 1 year ago

Hey john, totally got your point. But I have already found a setup to successfully identify ohnologous homeologs in my rediploidized allotetraploid genome using genespace. However, for some queried pangenes I get tons of entries where there should be only one. I guess this is due to their evolutionary highly disrupted regions. But these are exactly the genes of major interest. So, I gonna debug this myself because I think rerunning orthofinder in these syntenic regions might help here. And I'm just 2% below the threshold (=> many allelic contigs besides haploid pseudo-chromosomes result in a low percentage).

Regarding eudicots, the error was resolved after a clean rerun. Thanks!

jtlovell commented 1 year ago

ok. sounds good. if you query_pangenes(..., showArrayMem = FALSE, showNSOrtho = FALSE) do you still get all the huge gene fams? If so, then, yeah, I think its a bug. Otherwise, its likely that you have some big gene families due to your deep evolutionary scale of the orthofinder run.

hanneskramml commented 1 year ago

Niice. Didn't consider that. Now it works. Thanks a lot, John!

jtlovell commented 1 year ago

Oh sweet! just be careful with those results ... it chooses the physically most central (and longest as a tiebreaker) gene in the collinear orthogroup. This may or may not be the gene you are looking for.

Thanks for your help reporting the bugs above. I'll close this, but re-open if anything comes up.

hanneskramml commented 1 year ago

Hi John,

the following conditions do not meet the multiple outgroup requirement:

run_genespace:134   ps <- all(gsParam$genomeIDs %in% spids) && all(spids %in% gsParam$genomeIDs)
run_orthofinder:25  if(!is.na(gsParam$outgroup)){...}

Thanks for fixing! Best, Hannes

jtlovell commented 1 year ago

so you get an error when running orthofinder? Makes sense. I can't push a patch today - I have several other updates in progress. It will likely be next week.

There is a way around it though ... although a bit hacky: get the orthofinder run completed without using outgroups. Just include all the genomes you want as genomeIDs in the init_genespace call. Once the orthofinder run is done and GENESPACE gets moving, cancel the run, and add the outgroups to the init_genespace call. Then, when you re-run run_genespace, it will find the full orthofinder results and ignore the genomes flagged in outgroups.

I'll tag this issue when updating. Probably mid next week.

hanneskramml commented 1 year ago

No rush, I already had a workaround but still wanted to report the bugs for future releases...

jtlovell commented 1 year ago

are you still getting this error with the newest (v1.3.0) release?

jtlovell / GENESPACE

Error when ignoring multiple genomes #82