greenelab / snorkeling

Extracting biomedical relationships from literature with Snorkel 🏊
Other
58 stars 17 forks source link

Reconstruct hetionet #100

Closed danich1 closed 4 years ago

danich1 commented 4 years ago

This section involves reconstructing hetionet by translating sentence scores into an edge representation. Edges can contain multiple sentences which results in a need for a way to combine each sentence. For this project I decided to perform a simple approach of just taking the Mean, Max and Median of each sentence group. Turns out taking the Max of each edge group is the best in terms of recalling already established edges.

As with the last PR just take a look at the notebooks.

danich1 commented 4 years ago

Then if the max, mean, or median score is above some threshold you say there exists an edge between X and Y?

Almost. You could argue that an edge only exists given a threshold, but right now each edges has a likelihood score of existing given the max value of each sentence score

What is your gold standard in this case?

The gold standard here is the edges that are already existing in hetionet.

How did you decide the threshold?

The threshold I chose for that figure was a bit arbitrary (0.5 since probabilities). Note that 0.5 is not an optimal cutoff for every application. there are more applications that prefer to select a higher threshold because they want less noise compared or vise versa. Ideally, when I load this into hetionet I'll be incorporating all scored edges and just give your threshold may vary disclaimer. That's the beauty of confidence scores

I guess the only downside is if there are conflicting evidences.

Very good point. It be interesting to see the edges that have conflicting evidence. I agree using the max loses out on interesting information as ^. Future todo is to get more creative on how to translate sentences into edges.