Streamline predicate/analysis workflow

ielis commented 8 months ago

Fixes #87 , #92

Depends on #94

ielis commented 8 months ago

@lnrekerle

I'm proposing a revamp to the CohortAnalyzer.

The CohortAnalyzer is an abstraction - a promise what CohortAnalyzer can do for the user. To get CohortAnalyzer we use a similar pattern to configuring PhenopacketPatientCreator. There is a config method that will give you CohortAnalyzer:

from genophenocorr.analysis import configure_cohort_analysis

analysis = configure_cohort_analysis(cohort, hpo)

You'll get an analysis with default options. If you want to tweak the options, build the CohortAnalysisConfiguration:

from genophenocorr.analysis import CohortAnalysisConfiguration

configuration = CohortAnalysisConfiguration.builder()
  .include_sv(True)
  .pval_correction('fdr_bh')
  .build()

analysis = configure_cohort_analysis(cohort, hpo, configuration)

Then we run the analysis, e.g. to compare MISSENSE vs others:

from genophenocorr.model import VariantEffect
from genophenocorr.analysis.predicate import BooleanPredicate

results = analysis.compare_by_variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id='NM_1234.5')
result_df = results.summarize(hpo, BooleanPredicate.YES)
result_df.head()

We get results, a container with a lot of data. We call summarize to prepare a data frame with phenotypes vs. genotypes, ordered by corrected p values.

Note that we provide BooleanPredicate.YES to show genotype-phenotype correlation for present HPO terms, not for not-present (we would use BooleanPredicate.NO to show those).

This is what the PR adds. Thanks to the changes, we have a general framework for applying genotype and phenotype predicates and showing the results.

Please check out the code, try it out and we can discuss in greater detail the next time.

ielis commented 8 months ago

Now, with the develop merged into the PR branch, we should be OK to move forward with this PR if the code looks good.

monarch-initiative / genophenocorr

Streamline predicate/analysis workflow #88