Closed julesjacobsen closed 4 years ago
In SimilarityScoreSampling, I cannot figure out what the generic type T is supposed to be used for. I suspect that at some point we refactored this from accepting strings to accepting TermIds, and thus T is no longer needed.
This prevents us from genericising the ObjectIds:
private boolean selectObject(Integer objectId) {
if (options.getMinObjectId() != null && objectId < options.getMinObjectId()) {
return false;
}
return options.getMaxObjectId() == null || objectId <= options.getMaxObjectId();
}
Because it assumes we can manipulate T with "<="
Sorry I did not notice this ticket. I refactored the ObjectScoreDistribution class from Integer as key to generic, because it feels quite constraint to be forced to use Integer in other applications. I did not really understand the H2 database serialization, and still do not know how to use generic on database query:
final int objectId = rs.getInt(2);
This is supposed to be T. I am not sure whether it is possible to do so, otherwise, we may have to revert back to using Integers.
Additionally, this line also seems wrong (first and second variable seems swapped):
return new ObjectScoreDistribution(termCount, objectId, sampleSize, scoreDist);
The class is defined without tests, so the problems were not picked up after refactoring.
I agree that the TextFileScoreDistributionReader also needs to use generic. I will try this later today. If I cannot correct them easily, I will revert the commits. Sorry for this.
@kingmanzhang please write some tests! These will help define the problem and prevent confusion.
Sorry for not able to do this in time. I will give it another try, and either fix it or revert my changes. Currently still working with the generic_patch branch.
I made a last attempt to fix the warnings, but not able to address all. The ' generic_patch ' is the furthest I could mange to get to. It will take more time for someone who understands the score distribution codes better to put class variables there. These classes are only used in the new Phenomiser project. An alternative solution would be to revert two commits I made that were intended to make the scoredistribution class generic. If that's preferred, merge in the 'revert_score_dist_generic' (I recommend doing this after Peter publishes the LIRICAL paper so that you do not need to anything to phenomizer).
I think the score distribution always relates to an object that can be represented with a TermId. Currently, we have ONLY one use case, Phenomizer, and the objects are OMIM diseases, Orphanet diseases, or could be Mondo diseases. So I think we can probably simplify this and hard code TermId instead of the generic parameter. Using Integers is not great because we also need to keep a map from Integer to TermId. There may be some place in the code where we have the objects in an array or matrix where we need the ints -- let's take a look. If not, let's definitely go for TermIds here.
I think also that we can refactor Item2Pvalue
ScoreDistribution is currently this
ScoreDistribution<T extends Serializable>
But I think we can just take T to be a TermId and lose the genericity. There is no other use case in Phenol.
I figured this out and have removed the generics. They are (were) unnecessary in the current implementation and added complexity to the implementing classes, which needed to create an extra index between an Integer and the domain object, which in phenol will always be a TermId. I am testing this now in our Phenomiser clone and if the results are the same I will merge.
Todo -- SimilarityScoreSamplingTest needs to be fixed.
There are a lot of these warnings being thrown in the tests:
Looking at the test code it's not clear what the type ought to be:
What are these generic types used for? Why declare it if you're not going to use it?