monarch-initiative / owlsim-v3

Ontology Based Profile Matching
16 stars 5 forks source link

co-annotation analytics #1

Open nlwashington opened 9 years ago

nlwashington commented 9 years ago

from the old sim2 codebase, i took a crack at computing a co-annotation matrix using the term frequency-inverse document frequency algorithm (TF-IDF). that code is in these methods:

computeTFIDFMatrix getCoannotatedClassesForAttribute getCoAnnotatedClassesForIndividual getCoAnnotatedClassesForAttributes getCoAnnotatedClassesForMatches populateFullCoannotationMatrix getSubsetCoannotationMatrix initCoannotationMatrix

this needs to be ported from sim2 and refactored. it worked in my tests, but the performance was terrible once i scaled up to actual full-size data. i think the refactor will need to use a sparse matrix.

these will then provide the necessary calls for services to get commonly co-annotated classes

nlwashington commented 9 years ago

math-commons now has sparse matrix classes that could be useful here. http://www.cs.waikato.ac.nz/ml/weka/ has a whole ton of machine learning stuff that we might find useful, and also includes machine learning algorithms (and some visualization stuff too). and there's a very basic matrix library, which has some nice convenience functions: http://la4j.org/

tudorgroza commented 9 years ago

I found this library useful for working with matrices: https://sites.google.com/site/qianmingjie/home/toolkits/laml

and this one for graphs: http://www.i3s.unice.fr/~hogie/grph/

nathandunn commented 9 years ago

Math commons has a nice ecosystem if needs go beyond matrices. Haven't used weka for awhile, but they were much more geared to statistical processing and visualization of data-sets.

apseyed commented 9 years ago

Hi there, also there is Mahout, which some nice libraries for vector operations. I've briefly used RandomAccessSparseVector and SparseMatrix classes, and I believe there are some classes for similarity e.g., cosine. There is a book and active mailing list for that project, fwiw.

tudorgroza commented 9 years ago

I've added the implementation for TF-IDF - or more concretely an adapted version to pairs of terms. There is one issue, though: knowledgeBase.getTypesBM(individualId) returns, among other things, OWL:Thing or MP:000001 (in the example I've used), which should be discarded by default from the resulting TF-IDF ranking.

Here's the question: Is the class_index of OWL:Thing hardcoded, or is it always different? How can I find it without explicitly retrieving it from the KB via the classId?

jmcmurry commented 8 years ago

@cmungall Reviving this very old thread to ask whether there's been any recent discussion? This feature is important in order to aid the deep phenotyping, for mod researchers, physicians, and patients alike.

cmungall commented 8 years ago

not yet

On 12 Apr 2016, at 9:33, Julie McMurry wrote:

@cmungall Reviving this very old thread to ask whether there's been any recent discussion? This feature is important in order to aid the deep phenotyping, for mod researchers, physicians, and patients alike.


You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/monarch-initiative/owlsim-v3/issues/1#issuecomment-208994893