Reason for edge masking pid items in PredPath implementation

himahuja commented 6 years ago

According to my inspection, there are some irregularities between the provided implementation of the PredPath code and the original implementation as suggested in the publication.

Firstly, all the relations which match pid are masked, whereas, in the paper they use the presence and absence of pid as class labels.
Class labels information (to be used as ground truth y) is used to mark negative and negative features in the implementation.

I might be wrong in my comprehension of the code. If that's the case, please provide some clarification on how the implementation matches up to the code. Thank you for this repo!

shiralkarprashant commented 6 years ago

Sorry for the delay in my response. Please find my answers below.

Firstly, all the relations which match pid are masked, whereas, in the paper they use the presence and absence of pid as class labels.

Let me answer this in two steps: A) Yes, in the paper, they use presence and absence of pid as class labels, and the pairs are determined on the fly from the graph based on presence/absence of edges of the relation. In our test datasets, we have taken a slightly different approach for the ease of evaluation. We have instead pre-assembled positive and negative node pairs for each relation (test dataset) and used ground truth (1/0) as supervision. See https://github.com/shiralkarprashant/knowledgestream/blob/master/algorithms/predpath/predpath_mining.py#L66. B) Now regarding your question about masking, the paper removes all edges of the given relation (test dataset) before training the model in order to test its ability for fact checking. It thus uses a modified graph; see the first paragraph under "IV B. Experiment Setting" in the original paper https://arxiv.org/pdf/1510.05911.pdf. In our implementation, the masking is done to identify and remove such edges (https://github.com/shiralkarprashant/knowledgestream/blob/master/algorithms/predpath/predpath_mining.py#L69) and thereby create this modified graph. Thus, although there may be minor differences in the implementation, the approach and ultimate evaluation is the same.
Class labels information (to be used as ground truth y) is used to mark negative and negative features in the implementation.

I am not sure I completely understand this comment. In general, the class labels information (either determined by presence/absence of edges from a given relation or by assembling the positive and negative node pairs of a relation manually) is used as supervision while training, and as ground truth for validation during testing. Please let me know the precise implementation part that caused the confusion so that I would be better able to help.

Hope this helps. Thanks for using the repo! Hope you find it useful. And of course, if you have suggestions to make it better, please drop me a line or open a new issue. Thanks!

himahuja commented 6 years ago

Your answer to the first issue answers for both of my queries! Thank you for your comprehensive response and time. This repo serves as an amazing benchmark model for computational fact-checking. Again, thank you for this repository!

shiralkarprashant / knowledgestream

Reason for edge masking pid items in PredPath implementation #5