monarch-initiative / genophenocorr

Genotype Phenotype Correlation
https://monarch-initiative.github.io/genophenocorr/stable
MIT License
4 stars 1 forks source link

Added testing for disease predicates #151

Closed lnrekerle closed 1 month ago

lnrekerle commented 2 months ago

closes #137

I added a disease test, and proof that it works in the RPGRIP1 notebook. A couple questions before I push this tho:

Thank you @pnrobinson @ielis

ielis commented 2 months ago

Hi @lnrekerle

thanks for the code! I added a few improvements/fixes to the code.

Regarding the questions.

Did we want to run all diseases (like we do currently with phenotypes) or have the user give a specific disease (what we're currently doing)?

I think we should let the user to provide a set of "specific diseases" - these are the diseases the user is interested to see. Let me call these query diseases. The query diseases act as a filter. If we have a cohort with 5 diseases and the user provides 2 query diseases, at most 2 query diseases will make it into results. If zero or (perhaps) just one of the query diseases is in fact represented in the cohort, the code should probably explode since this is likely a bug. Otherwise, we would either return an empty result or a result with correlation for one disease only, which IMO is not what the user wants to see.

Besides supporting the query diseases, IMO many times the users will want to see all diseases of the cohort. We should make this easy. One way to do this is to write a function like this:

class CohortAnalysis:

    # bla bla bla

   def compare_genotype_by_diseases(
        self,
        genotype_predicate: GenotypePolyPredicate,
        disease_ids: typing.Optional[typing.Sequence[hpotk.TermId]] = None,
   ) -> GenotypePhenotypeAnalysisResult:
        pass

The function takes a genotype poly predicate, which can be (as you know) anything, and an optional sequence of disease IDs formatted as hpotk.TermId. If disease_ids==None, then the user wants to see all diseases. If len(disease_ids) in {0, 1} we should probably explode because of what I wrote above. Otherwise, we use the disease_ids as a filter.


Should I rewrite compare_disease_vs_genotype so that the user gives a function they would normally see rather than having to give a class that's normally hidden? Or is the way it is right now good for now? This could be a rather large rewrite for that file (_gp_impl.py)

This question is a little hard for me to understand, I'll try my best though. I think we are approaching a point where we will have to write a function like this:

def compare_genotype_by_phenotype(
    self,
    genotype_predicate: GenotypePolyPredicate,
    phenotype_predicate: PhenotypePolyPredicate,
) -> GenotypePhenotypeAnalysisResult:
    pass

This will become a central point of the entire analysis. All convenience functions that we currently have on CohortAnalysis, such as compare_by_protein_feature_type or compare_by_variant_key will actually boil down to compare_genotype_by_phenotype. So, most of the time, the user will not have to deal with a "class that is normally hidden" (e.g. ExonPredicate or PropagatingPhenotypePredicate) but the user will call one of the convenience function. However, when the convenience function is not there, the user will have to create genotype/phenotype predicate and run the analysis.

This may seem like a huge change, but we actually +- have this in GpCohortAnalysis._apply_poly_predicate. We may need to discuss the requirements and tweak this a bit at some but it should not be too much work.

This sounds like a good subject for the next meeting. Please think about the requirements. Right now we _apply_poly_predicate takes an iterable of phenotype predicates and single genotype poly predicates. It would be good to think about the cardinalities, especially the corner cases of the phenotype part. E.g. what happens with patients diagnosed with multiple diseases.