Cannot find *_with_triples.pkl in qa_data at the web drive

leekum2018 commented 2 months ago

As I would like to reproduce the pipeline, could you mind providing *_with_triples.pkl and *_with_pred_triples.pkl. I can only find **_with_relevant_triples_wounkrel.pkl at the web drive.

Thank you very much!

jyfang6 commented 2 months ago

Hi @leekum2018 ,

Thank you for your interest in our work. These files can be now be downloaded from this link: https://osf.io/58a3t/ .

Feel free to contact me if you have any further questions!

leekum2018 commented 2 months ago

Thank you for your helpful reply! I have another question. If I want to apply your method to HotpotQA, is it proper that first I use Spacy NER tool and TAGME entity linking tool to identify the entities within the passages, and extract the KG triples between these entities from Wikidata, finally use the docunet ckpt you provide in extract intra-relations from the passages?

leekum2018 commented 2 months ago

I have another question about the intra-relation. I manually check the intra-relations for the first 3 examples on 2wikihop dev set. I am curious about why some relation embedded in the passages cannot be captured by docuNet? For example, these questions all query about the nationalities of persons or movies, while there are some explicit mentions in the passages. however, they cannot be captured. Here is the 3rd example on 2wikihop dev set.

Question:  Are the movies Kibar Feyzo and Forever Friends (Film), from the same country?
Text_1: <e> Kibar Feyzo </e> is a <e> 1978 </e> <e> Turkish </e> <e> comedy film </e> directed by <e> Atıf Yılmaz </e>

We can see that the text explicitly tells us the question entity's originality like < Kibar Feyzo, country, Turky >, however it is not contained in the list of intra-relation. So, I am curious about the low recall rate of docunet and whether it will negatively affect the final performance (although the wikidata relation can complement these information)

jyfang6 commented 2 months ago

Hi @leekum2018,

Sorry for the late reply. For your first question, the answer is yes. If you want to apply our framework to HotPotQA dataset, you need to go through the steps you mentioned to train the model.

Moreover, the reason why some relations can not be captured by DocuNet is because the DocuNet model is not directly trained on the experimental datasets we used. Instead, it is trained on the REBEL datasets and then transferred to our experimental datasets. So it might miss some relations and we use triples from Wikidata to compensate such imperfection.

Additionally, due to the lack of labeled data, we do not directly evaluate the recall rate of DocuNet for extracting intra-relations. However, our qualitative analysis also showed that the quality of the extracted triples is the bottleneck of our model. In our follow-up work: https://github.com/jyfang6/trace, we use LLM to extract knowledge triples and we found that this approach can extract higher-qualiy knowledge triples and therefore lead to better performance. Feel free to explore this approach if you're interested.

jyfang6 / REANO

Cannot find *_with_triples.pkl in qa_data at the web drive #1