thunlp / JointNRE

Joint Neural Relation Extraction with Text and KGs
MIT License
187 stars 36 forks source link

Datasets for JointNRE #3

Closed AdityaAS closed 6 years ago

AdityaAS commented 6 years ago

Hi,

In the paper, it is mentioned that the dataset used for KGC are FB15K and NYT-FB15K and the dataset used for RE is FB60K and NYT-FB60K but the dataset link provided consists of only one dataset i.e. FB60K and the corresponding sentences.

Would it be possible for you to share the NYT-FB15K and NYT-FB15K-237 datasets as well? I have downloaded the FB15K and FB15K-237 datasets from https://everest.hds.utc.fr/doku.php?id=en:transe and https://www.microsoft.com/en-us/download/details.aspx?id=52312 respectively.

What I need is the KG-Text aligned dataset for FB15K and FB15K-237 as described in the "Initialization and Implementation Details" section of your paper in order to reproduce the results for the link prediction task.

THUCSTHanxu13 commented 6 years ago

Because the released code and data sets mainly focus on the relation extraction task. I will release corpus aligned to FB15K after my mid-term examination = =!

AdityaAS commented 6 years ago

Oh. All the best for your mid-term :+1:

In the meanwhile, I will try to generate the text-KG aligned dataset for FB15K and FB15K-237 using the NYT corpus. But I have a few questions in order to do that.

  1. Which year's data in the NYT corpus have you used?
  2. Did you just do an anchor text - entity text matching or was there some pre-processing / heuristics involved? If yes, if you could mention it that would be great.
AdityaAS commented 6 years ago

Also, I noticed that the vec.txt file located in the origin_data/text/ folder contains word embeddings of dimensionality 50. It would be great if you could also tell me which corpus word2vec was trained on so that I can generate 100-dimensional word embeddings for the same vocabulary.

THUCSTHanxu13 commented 6 years ago

We follow the settings in (https://github.com/thunlp/NRE) to prepare our corpus which is used for text-entity alignments and training word embeddings. Pre-trained word embeddings are learned from New York Times Annotated Corpus (LDC Data LDC2008T19), which should be obtained from LDC (https://catalog.ldc.upenn.edu/LDC2008T19). Our alignments are also based on this data. Because of the license, I can not directly release this data. Maybe you can try to download this data.

THUCSTHanxu13 commented 6 years ago

FB60K is mainly based on the dataset NYT10, which is originally released by the paper "Sebastian Riedel, Limin Yao, and Andrew McCallum. Modeling relations and their mentions without labeled text." (http://iesl.cs.umass.edu/riedel/ecml/) This data set can be downloaded directly. Above all, I will release NYT-FB15K. However, if you want to get the whole LDC2008T19, you may need to contact with the data authors by yourself.

AdityaAS commented 6 years ago

Thanks for the info.

gloryVine commented 4 years ago

FB60K is mainly based on the dataset NYT10, which is originally released by the paper "Sebastian Riedel, Limin Yao, and Andrew McCallum. Modeling relations and their mentions without labeled text." (http://iesl.cs.umass.edu/riedel/ecml/) This data set can be downloaded directly. Above all, I will release NYT-FB15K. However, if you want to get the whole LDC2008T19, you may need to contact with the data authors by yourself.

Where did you release NYT-FB15K?

gloryVine commented 3 years ago

I am still interested in the dataset @THUCSTHanxu13 .