monarch-initiative / genophenocorr

Genotype Phenotype Correlation
https://monarch-initiative.github.io/genophenocorr/stable
MIT License
4 stars 1 forks source link

fixed the count issue with summarize #145

Closed lnrekerle closed 2 months ago

lnrekerle commented 3 months ago

closes #141

Fixed the count issue so it now shows (for example) the total "yes"s in a phenotype/genotype category over the total available samples in a phenotype category.

Also updated the naming for Disease in _phenopacket.py

lnrekerle commented 3 months ago

@pnrobinson

I'm confused as to what the 1 in "1/14" would be in your example? Would this mean that one patient of the 14 total patients with a nonsense mutation shows as having Arachnodactyly?

My understanding is that it should be (total patients with variant and trait)/(total patients with trait). Which would still make what I submitted incorrect for comparing a poly predicate, since the total with one variant added to the total with another variant would not necessarily be all the patients with the trait.

ielis commented 3 months ago

Hi @pnrobinson

so, we're talking about a situation like this:

Yes missense Not missense Total
Yes Arachnodactyly 1 13 14
No Arachnodactyly 3 9 12
Total 4 22 26

In both the original and the new setting, we were reporting counts for Yes Arachnodactyly. That's what the magic constant BooleanPredicate.YES is for.

However, in the original setting, we were reporting 1/26 and 13/26. Now, the code reports 1/14 and 13/14, normalizing the counts to the Yes Arachnodactyly total.

@pnrobinson does this sound wrong? If yes, can you please suggest what should be done instead?

pnrobinson commented 3 months ago

@lnrekerle part of the confusion is that in the end we want to know four numbers -- the 2x2 table. We are presenting a sort of summary of these numbers, but it seems it is not clear how to go from the summary to check the numbers in the 2x2 table.

If we have two columns, say NONSENSE and MISSENSE, and we are looking at one HPO term HP1, then imagine we have these numbers NONSENSE HP1 present 1, HP1 absent 4 (total NONSENSE 5) MISSENSE HP1 present 8, HP1 absent 2 (total MISSENSE 10)

then in the NONSENSE column I think we should show 1/5 (20%)and in the MISSENSE column 8/10 (80%) Does this make sense? What are we showing now?

ielis commented 3 months ago

@pnrobinson we both posted a message around the same time. Can you pls check my message above to see if we are talking about the same thing?

pnrobinson commented 3 months ago

It does sound wrong. First of all, there are 26 patients but 14+14=28! Think of the column - it is for Yes missense -- there are just 4 patients in this column. Arachnodactyly should thus be 1/3. Nonsense should be 13/22.

pnrobinson commented 3 months ago

@ielis @lnrekerle I added a unit test to help to clarify the way we count -- this is pretty complicated and I am not sure of the best way to get the counts for any given term

See test_get_count in https://github.com/monarch-initiative/genophenocorr/blob/fixing_counts/tests/test_analysis.py

I am wondering if it would be better to define convenience functions that are like this, where hpo_p_gt_m is the count for HPO plus (p) and genotype minus (m).

def get_counts_tuple(df:pd.DataFrame, hpo_id:hpotk.TermId):
   # do some black magic
   return (hpo_p_gt_p, hpo_p_gt_m, hpo_m_gt_p, hpo_m_gt_m)

This would allow client code to only worry about statistics and display. As it is, the client code essentially needs to know a lot about the way the analysis code is arranging the data. Alternatively, could we write more detailed unit tests, something like that one I wrote, that show and document in detail how to understand the data structure? 😄

I am thinking it is probably better to have separate getter functions for 2x2 and 2x3, otherwise again the client code needs to know too much and it makes the code for statistics pretty difficult to write. At some point, we might also want to return something like a number and do some other test (t test, Cox), and so we will need to have different gettings at the end of the day.

ielis commented 3 months ago

Hi @pnrobinson thanks a lot for the test! Regarding your points.

I am wondering if it would be better to define convenience functions that are like this, where hpo_p_gt_m is the count for HPO plus (p) and genotype minus (m).

Yeah, we have the counts of each phenotype/genotype within the self._all_counts data frame. I think it may be somewhat hard to grok and debug. We can perhaps format the count as typing.Mapping[hpotk.TermId, pd.DataFrame]?

It would be a dict where the key is the HPO term (or OMIM/MONDO for the other phenotype meaning), and the value is the contingency table. This would be easy to achieve.

However, I am not sure we should be working with tuples, because the tuples do not easily generalize to 2x3 tables, or FWIW mxn tables. I think it is better to keep the counts in a data frame, along with the associated categories in index and column of the frame.

I am thinking it is probably better to have separate getter functions for 2x2 and 2x3, otherwise again the client code needs to know too much and it makes the code for statistics pretty difficult to write. ...

This is something we may need to talk about. By the client code, you mean the statistical tests, right? Would it be easier to write the statistics code starting from the dict I described above? It would still, however, have to deal with an m x n contingency table. I think it is OK for the statistics code to give up on 4x5 table if it only supports 2x2 tables. My aim is to ensure separation of the application of predicates (counting) and the downstream analysis. So we need to write this up in code. Hence, talk about first.

pnrobinson commented 3 months ago

For the MTC code (and where we decide what test to do and what test to leave out), we need to do things one term at a time. I think that we can do with a number of different options, but the easiest thing by far would be to have the dictionary return a data structure that is the same as we ned for the statistics functions -- I think this is a Python array of two subarrays for Fisher and something like that for the 2x3. This is an implementation detail and not something that users should need to care about, so as long as we have this well documented and tested, it would be fine and it would make the client code be a lot simpler.

ielis commented 3 months ago

@pnrobinson I reworked almost the entire core code to support all sorts of "phenotype" concepts, not just HPO terms. We'll work with @lnrekerle on implementing the predicate for using disease IDs in place of as "phenotype".

There almost no breaking changes in the API and I fixed the one breaking change everywhere I could. In brief, we do not use BooleanPredicate.YES anymore, but we do something like:

from genophenocorr.analysis.predicate import PatientCategories
result.summarize(hpo, PatientCategories.YES)

to visualize the results.

I updated the notebooks, the tutorial in the docs should also be current, as well as the test you wrote for this PR.

@pnrobinson please try to run a notebook or see the updated test to check how this works. Thank you!

pnrobinson commented 2 months ago

@ielis @lnrekerle checked this again and it looks good, I will merge!