hetio / medline

Computing term cooccurrence in MEDLINE
https://doi.org/10.15363/thinklab.d67
16 stars 4 forks source link

Expected denominator is wrong causing underestimates of enrichment #1

Closed LarsAC closed 3 years ago

LarsAC commented 3 years ago

Hello,

I am unsure about the denominator in the formula referenced below. Should expected not be equal to the product of the two frequencies of occurrence ? In that case, wouldn't the denominator be equal to the square of total_pmids ? It may be splitting hairs and not make a large difference overall, but I'd like to understand the maths behind the code a little more.

https://github.com/hetio/medline/blob/0c9e2905ccf8aae00af5217255826fe46cba3d30/cooccurrence.py#L58

Thanks, Lars

dhimmel commented 3 years ago

I think you're right and you found a bug!

It may be splitting hairs and not make a large difference overall

Seems to me like it will have a major impact on the expected and enrichment scores (although not their ranking). The measures deriving from the contingency table and fisher's exact test don't appear to be affected.

Would you be interested in submitting a pull request to fix that code?

I also noticed a typo in the docstring:

https://github.com/hetio/medline/blob/0c9e2905ccf8aae00af5217255826fe46cba3d30/cooccurrence.py#L22-L23

The second line should be term1_to_pmids.

LarsAC commented 3 years ago

Hi Daniel,

thanks for quick confirmation. I will submit a PR, am just trying to reproduce the old vs. new data to understand what impact it might make. Once done I'll submit the change.

Lars

LarsAC commented 3 years ago

PR is created and ready for your review.

I ran the code before / after the fix. As expected, odds ratio and fisher value do not change. Values for expectedare now obviously much smaller (e.g. 0.000234 vs 1246.7 for Alzheimer's/Parkinsons's disease-disease cooccurrence) and enrichmentvalues become much large (in the range of 10^3...10^7).

dhimmel commented 3 years ago

Thanks @LarsAC, I merged https://github.com/dhimmel/medline/issues/1. If you end up regenerating any of the datasets with corrected numbers, feel free to continue making PRs. Really appreciate your initiative reporting the issue and helping fix it!

dhimmel commented 3 years ago

I thought about this some more and I actually think the original formation was correct. Here are the two options:

# original
expected = len(pmids0) * len(pmids1) / total_pmids

# revised by https://github.com/hetio/medline/pull/2
expected = len(pmids0) * len(pmids1) / total_pmids**2

In the revised formula, expected ends up being the probability that two-terms will cooccur for a single pubmed_id, since it's the probability of the pubmed_id being assigned term 0 multiplied by the probability of pubmed_id being assigned term 1.

But expected is an expected count rather than a probability. To convert the probability to an expected count, we multiply by the total number of pubmed_ids. Does that make sense?

I'll go ahead and revert #2.