thomasp85 / FindMyFriends

Fast alignment-free pangenome creation and exploration
27 stars 6 forks source link

cdhitGrouping fails - Fatal Error: invalid clstr #9

Closed thackl closed 8 years ago

thackl commented 8 years ago

Hi,

I am trying to build a pan-genome from 193 bacterial strains, with increased sensitivity in cdhitGrouping() to account for high diversity in the set. However, The grouping fails with a fatal error. I am using the latest development version due to #8.

> cdhitGrouping(pg, kmerSize=4, lowerLimit=.3, maxLengthDif=.1)
Preclustering |==================================================| 326413/326413
Preclustering resulted in 99985 gene groups (1 minute and 56.891 seconds elapsed)
Grouping      |==================================================| 99984/99984
Grouping      |==================================================| 94925/94925
Grouping      |==================================================| 74635/74635
Grouping      |==================================================| 62656/62656
Grouping      |==================================================| 53042/53042
Grouping      |==================================================| 46524/46524
Grouping      |==================================================| 40786/40786
Grouping      |==================================================| 35662/35662
Grouping      |==================================================| 31360/31360
Grouping      |==================================================| 27763/27763
Grouping      |==================================================| 25043/25043
Error in eval(substitute(expr), envir, enclos) : 
Fatal Error:
%s
Program halted !!

invalid clstr
In addition: Warning messages:
1: In split.default(seq_len(nGenes(object)), unlist(groups)) :
  data length is not a multiple of split variable
2: In split.default(groups, groupsGroups) :
  data length is not a multiple of split variable
3: In split.default(groups, groupsGroups) :
  data length is not a multiple of split variable

Grouping with default settings works, but also issues the warnings - not sure if thats related... Regardless, the number of clusters is much larger than I'd like it to be, i.e. many core orthologs are split in multiple subgroups group - hence my attempt to increase sensitivity.

cdhitGrouping(pg)
Preclustering |==================================================| 326413/326413
Preclustering resulted in 99986 gene groups (37.099 seconds elapsed)
Grouping      |==================================================| 99985/99985
Grouping      |==================================================| 94907/94907
Grouping      |==================================================| 74616/74616
Grouping      |==================================================| 62707/62707
Grouping      |==================================================| 54564/54564
Grouping      |==================================================| 47993/47993
Grouping      |==================================================| 42191/42191
Grouping      |==================================================| 37075/37075
Grouping      |==================================================| 32750/32750
Grouping resulted in 29087 gene groups (3 minutes and 2.29 seconds elapsed)
Total time elapsed was 3 minutes and 39.393 seconds
An object of class pgFullLoc

The pangenome consists of 326414 genes from 193 organisms
29087 gene groups defined

     Core|=
Accessory|==================:
Singleton|==============================:

Genes are translated
Warning messages:
1: In split.default(seq_len(nGenes(object)), unlist(groups)) :
  data length is not a multiple of split variable
2: In split.default(groups, groupsGroups) :
  data length is not a multiple of split variable

Any help is highly appreciated! Thomas

thomasp85 commented 8 years ago

Would it be possible to share the data? In that way it is much easier to debug...

thackl commented 8 years ago

I can totally see your point. Not sure though if sharing is possible at this point - quite a few people involved - will get back to you on that asap.

thackl commented 8 years ago

Ok, I was able to reproduce the issue with a reduced data set, comprising sequences I am more comfortable with sharing. Still not really public data though. Do you have an email address to send the download link to?

thomasp85 commented 8 years ago

Sure - just use the one in the DESCRIPTION

thackl commented 8 years ago

K, you should have gotten a gdrive sharing invite. Hope you can figure something out. Thanks for taking the time!

thomasp85 commented 8 years ago

I've gotten it - I'll have a look and see if I can reproduce it during this week

thomasp85 commented 8 years ago

So it seems the CD_Hit algorithm will not accept thresholds below 0.4, which is the cause of your error message. I was unaware of this and will probably add a check for this in the future...

That said 0.4 is quite a low similarity already...

As for your expected number of core genes - FindMyFriends is more conservative in grouping genes than other algorithms as it rigorously checks for a matching chromosomal neighbourhood as well as avoids grouping fragments together with functional genes.

If the number seems way of, please investigate wether the geneLocation matrix order matches with the order of the genes in the pg object. If there's a mismatch it will have huge consequences...

thackl commented 8 years ago

That makes a lot of sense. Should have figured that out myself, ran into a similar issue with cdhit a while back...

Yeah, the rather strict setup of FindMyFriends probably renders it a suboptimal choice for my data set. The gene clusters themselves are diverse, but what is probably worse, most of the data are single cell genomes, i.e. incomplete and split into several contigs. I'm pretty sure this makes gene neighborhood assessment less reliable and fragmented genes more likely...

But I very much liked the design of your software and at least wanted to give it a try. Thanks a lot for taking the time and looking into the error.

Cheers Thomas

thomasp85 commented 8 years ago

One possibility would be to use kmerSplit() rather than neighborhoodSplit() as it ignores gene location at the expense of less sensitivity.

Further, if you want to use the utilities of FindMyFriends but can't use the grouping algorithm on your data you can always perform the grouping in some other software and import that using manualGrouping()