sb895 / Hallmarks-of-Cancer

Expert annotated Hallmarks of Cancer Corpus
GNU General Public License v3.0
19 stars 10 forks source link

data format for labels #1

Open gabriben opened 3 years ago

gabriben commented 3 years ago

Hi,

I'd like to use your dataset to reproduce some results in the ML-NET paper, but I am having trouble understanding how the label text files should be read.

Thank you

sb895 commented 3 years ago

Hi Gabriben,

Apologies, only just saw your message.

There are two folders, labels and text.

the "text" contains files that have PubMed Abstracts, split one sentence per line (already tokenized). The file names are the PubMed IDs.

The "labels" contains corresponding labels for each text file (both will be named with the same PubMed ID). The file format is as follows: they contain multiple labels per sentence.

The are sentence labels are separated by "<", and the multi-labels for each sentence is separated by "AND".

Hope that helps. let me know otherwise.