monarch-initiative / phenol

phenol: Phenotype ontology library
https://phenol.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
23 stars 4 forks source link

PrecomputingPairwiseResnikSimilarity and maximum array size #234

Closed pnrobinson closed 4 years ago

pnrobinson commented 5 years ago

This method is crashing because the number of input GO terms was too high in one attempt 49934*49934 = 2493404356 The maximum size of an int is 2,147,483,647 This is also the maximum size of an array, and so this line

   data = new float[termIdCount * termIdCount];

caused

termIdCount=49934
termIdCount * termIdCount=-1801562940
Exception in thread "main" java.lang.NegativeArraySizeException

We should at least put in a warning that this kind of overflow has happened.

pnrobinson commented 5 years ago

The issue is that GO has so many terms (49934) and the Resnik module computes the pairwise similarity for all terms at once. This is probably not what we want for downstream applications, but this will require substantial refactoring (e.g., calculating the similarities only for terms actually used for annotation of some dataset). Thoughts, anybody?

kingmanzhang commented 5 years ago

I wonder does it really help using an array instead of a map to store precomputed scores.

pnrobinson commented 4 years ago

@julesjacobsen I am think that the best solution is only to calculate similarity for GO Terms that are actually used in some annotation file (the graph induced by their ancestors). I will implement unless there is a better idea. Java cannot use longs to index arrays (https://stackoverflow.com/questions/27333879/can-you-index-an-array-with-a-long-int), so I do not say any better way.

julesjacobsen commented 4 years ago

Not having looked at the implementation (what class are we talking about here?), isn't is possible to stream/write the results out rather than having to return the entire result set?

Is this the line in question?:

https://github.com/monarch-initiative/phenol/blob/1de5b12e64fa344fec0107ce5c3135a786e958ba/phenol-core/src/main/java/org/monarchinitiative/phenol/ontology/similarity/PrecomputingPairwiseResnikSimilarity.java#L173

use a 49934*49934 matrix?

julesjacobsen commented 4 years ago

How about this? It's a simple solution, but given we're storing a symmetric matrix, we could only store half of it, right?

  private static final class PrecomputedScores implements Serializable {

    private static final long serialVersionUID = -6390653194662991513L;

    /** Mapping from term ID to term index. */
    private final HashMap<TermId, Integer> termIdToIdx;

    /** Internal storage of the similarity scores as matrix of floats. */
    private final float[][] data;

    PrecomputedScores(Collection<TermId> termIds) {
      int termIdCount = termIds.size();
      data = new float[termIdCount][termIdCount];
      termIdToIdx = new HashMap<>(termIdCount);

      int i = 0;
      for (TermId termId : ImmutableSortedSet.copyOf(termIds)) {
        termIdToIdx.put(termId, i++);
      }
    }

    /** Set score. */
    public void put(TermId lhs, TermId rhs, double value) {
      put(lhs, rhs, (float) value);
    }

    /** Set score. */
    public void put(TermId lhs, TermId rhs, float value) {
      final int idxLhs = termIdToIdx.get(lhs);
      final int idxRhs = termIdToIdx.get(rhs);
      data[idxLhs][idxRhs] = value;
      data[idxRhs][idxLhs] = value;
    }

    /** Get score. */
    public float get(TermId lhs, TermId rhs) {
      final Integer idxLhs = termIdToIdx.get(lhs);
      final Integer idxRhs = termIdToIdx.get(rhs);
      if (idxLhs == null || idxRhs == null) {
        return 0.0f;
      } else {
        return data[idxLhs][idxRhs];
      }
    }
  }
}
pnrobinson commented 4 years ago

What about combining two things

  1. Filtering the list of terms for thos terms that are actually used. The information content of the other terms is not defined (or can be set to zero), but there is no need to store this.
  2. Store the remaining, ,symmetric matrix, as one array using this scheme: http://wwwuser.gwdg.de/~applsw/Parallelrechner/sp_documentation/essl/essl133.html
julesjacobsen commented 4 years ago

What's with the insistence on using a list to represent the matrix? Is a matrix really that much slower? It's lot simpler to represent and won't suffer from the array length issue quite so readily. Given this is the square of the number of terms of the ontology it could readily be reached with large merged ontologies, even if filtering for used terms. However, give the matrix is also very sparse, using a dense matrix might not be the best solution either.

Depending on the temporal separation of get() and put(), assuming all put() operations happen before all get() operations, for large graphs an adjacency list could be used and finalised as a compressed sparse row. We could also use the fact that TermId implements comparable and only store the ordered term pairs, if we're really optimising for space. The optimal solution would be to optimise the implementation based on the number of input terms.

Or store the list of non-zero similarities in memory using some form of triplet format.

pnrobinson commented 4 years ago

The matrix actually is completely dense -- this is representing pairwise similarity, not the terms themselves. The maximum length of an array in Java is 2,147,483,647 (the maximum size of an int , 231 − 1). We have

math.sqrt(2147483647)
46340.950001051984

For the array, we need n(n+1)/2 elements, and we could do this even if we do not filter out the unused terms, which would be 1,246,727,145 for the current 49934 GO terms. Another strategy to save space is basically to demand that we get a list of all GO terms that are going to be used for the analysis. Usually, this is MUCH less than the total count of all GO terms. This is also easier to implement.

julesjacobsen commented 4 years ago

Would there be any way to enforce only getting the terms to be used for the analysis? What if you really didn't know or actually wanted the whole set? Is there a different class for that use case?

pnrobinson commented 4 years ago

For the intended use case (GeneOntology), we start from an annotation file (see http://current.geneontology.org/products/pages/downloads.html). This file contains all of the annotations used to annotate the set of human genes, and by far not all of the GO terms are used for any one species. Therefore, we can know before doing the analysis what terms to expect. Obviously, in software engineering "never say never" is a good rule when you are thinking about what kind of input your program might expect to receive, but in this case I think it might actually be a correct assumption. Nonetheless, your suggested solution would solve the maximum array size issue, as would the n(n+1)/2 solution, and so maybe the simplest thing for now is to implement one of these and see if there is a need to further optimize. We can add a flag to take just GO terms that are used in the annotation set.

julesjacobsen commented 4 years ago

I can commit the above code, with a more comprehensive unit test if you like.

pnrobinson commented 4 years ago

@julesjacobsen Do you have code to commit for this? I have fixed a number of GO-related issues and would hope to have this working for the 1.6.0 release. If not I can have a hack at it!

julesjacobsen commented 4 years ago

I already added a commit to fix this. Should be listed in the thread.

pnrobinson commented 4 years ago

Sorry, I must have missed this. So we limit now to terms that are actually used for annotations? Can we close? @julesjacobsen

julesjacobsen commented 4 years ago

Not right near a computer right now, to double check, but of you're happy with the change feel free to close!

pnrobinson commented 4 years ago

Thanks! This will be subject to additional testing in the future when we move forward with the GO algs, but let's close for now!