Assessing term co-occurrence across sentences

dhimmel commented 6 years ago

@danich1 has extracted a bunch of sentences that include both a gene and a disease. He's computing summary statistics for gene-disease pairs. One basic measure being the number of sentences with both the gene and disease. Now we want to compute the expected number of sentences, given the marginal frequency of the gene and disease.

Let's take an approach similar to computing MEDLINE term co-occurrence. You will need for each gene disease pair to compute the values for a contingency table where:

a is the number of sentences containing both the gene and the disease (cooccurrence)
b is the number of sentences containing the gene but not the disease
c is the number of sentences containing the disease but not the gene
d is the number of sentences without either the gene or disease

We should limit ourselves to only sentences that contain a gene and a disease. You'll be able to compute the expected and the p-value from a fisher's exact test using code like here.

The expected number of sentences is actually quite easy to compute. You just take the number of sentences with the gene * the number of sentences with the disease divided by the total number of sentences (only considering sentences with both a gene and disease).

dhimmel commented 6 years ago

@danich1 the convention is to only close the issue once the related code has been merged to master. Oftentimes, in your first pull request comment, you'll put something like Closes #26, so the issue gets automatically closed once the PR is accepted.

danich1 commented 6 years ago

Ah I jumped the gun. Will keep that in mind.

greenelab / snorkeling

Assessing term co-occurrence across sentences #26