Closed thackl closed 8 years ago
Would it be possible to share the data? In that way it is much easier to debug...
I can totally see your point. Not sure though if sharing is possible at this point - quite a few people involved - will get back to you on that asap.
Ok, I was able to reproduce the issue with a reduced data set, comprising sequences I am more comfortable with sharing. Still not really public data though. Do you have an email address to send the download link to?
Sure - just use the one in the DESCRIPTION
K, you should have gotten a gdrive sharing invite. Hope you can figure something out. Thanks for taking the time!
I've gotten it - I'll have a look and see if I can reproduce it during this week
So it seems the CD_Hit algorithm will not accept thresholds below 0.4, which is the cause of your error message. I was unaware of this and will probably add a check for this in the future...
That said 0.4 is quite a low similarity already...
As for your expected number of core genes - FindMyFriends is more conservative in grouping genes than other algorithms as it rigorously checks for a matching chromosomal neighbourhood as well as avoids grouping fragments together with functional genes.
If the number seems way of, please investigate wether the geneLocation matrix order matches with the order of the genes in the pg object. If there's a mismatch it will have huge consequences...
That makes a lot of sense. Should have figured that out myself, ran into a similar issue with cdhit a while back...
Yeah, the rather strict setup of FindMyFriends probably renders it a suboptimal choice for my data set. The gene clusters themselves are diverse, but what is probably worse, most of the data are single cell genomes, i.e. incomplete and split into several contigs. I'm pretty sure this makes gene neighborhood assessment less reliable and fragmented genes more likely...
But I very much liked the design of your software and at least wanted to give it a try. Thanks a lot for taking the time and looking into the error.
Cheers Thomas
One possibility would be to use kmerSplit()
rather than neighborhoodSplit()
as it ignores gene location at the expense of less sensitivity.
Further, if you want to use the utilities of FindMyFriends
but can't use the grouping algorithm on your data you can always perform the grouping in some other software and import that using manualGrouping()
Hi,
I am trying to build a pan-genome from 193 bacterial strains, with increased sensitivity in
cdhitGrouping()
to account for high diversity in the set. However, The grouping fails with a fatal error. I am using the latest development version due to #8.Grouping with default settings works, but also issues the warnings - not sure if thats related... Regardless, the number of clusters is much larger than I'd like it to be, i.e. many core orthologs are split in multiple subgroups group - hence my attempt to increase sensitivity.
Any help is highly appreciated! Thomas