greenelab / snorkeling

Extracting biomedical relationships from literature with Snorkel 🏊
Other
58 stars 17 forks source link

Is there a snorkel_labels_train.xlsx file anywhere? #108

Open jambo6 opened 2 years ago

jambo6 commented 2 years ago

I'd like to utilise these labels for another project. It seems the folder

snorkeling/disease_gene/disease_associates_gene/data/sentences

should also have snorkel_labels_train.xlsx to go along with its test and dev files. Does this exist and if so is there any chance of getting access?

danich1 commented 2 years ago

should also have snorkel_labels_train.xlsx to go along with its test and dev files. Does this exist and if so is there any chance of getting access?

So this folder only contains sentences that were manually hand labeled for this project. The train version isn't available as it is supposes to consist of all the remaining documents within Pubtator. The following output would be too big of a file for github to host on their LFS (max file is 2GB).

Currently, the main way to get those sentences is to download a snapshot of pubtator central and extract those sentences into a database. Otherwise I have a snapshot of the database used for this project that you could import (118GB); however, would need to figure out how to transport that large of a file. Overall recommendation is to use the first option as you would have the most current version for whichever project you are going to work on.

jambo6 commented 2 years ago

I was after the hand labelled train/dev/test sentences to bolster my dataset for a similar RE project, not the entire pubtator db. Would it be okay for me to use these and if so, is there a straightforward method to download just these sentences with hand labellings?

danich1 commented 2 years ago

I was after the hand labelled train/dev/test sentences to bolster my dataset for a similar RE project, not the entire pubtator db. Would it be okay for me to use these and if so, is there a straightforward method to download just these sentences with hand labellings?

Sure. Can't guarantee that train.xlsx exists or has a lot of sentences annotated but here are the quick links to the available data atm:

Compound Treats Disease Train Compound Treats Disease Dev Compound Treats Disease Test

Disease Associates Gene Dev Disease Associates Gene Test

Gene interacts Gene Train Gene interacts Gene Dev Gene interacts Gene Test

Compound binds Gene would take a bit for me to get to you so if you need that let me know.

jambo6 commented 2 years ago

So do there not exist handcrafted labels for Disease Associates Gene Train?

danich1 commented 2 years ago

I forgot to upload onto this repository, but here is your request file: Disease Associates Gene Train