haitian-sun / GraftNet

BSD 2-Clause "Simplified" License
268 stars 56 forks source link

"passages" and "pagerank_score" fields in JSON files! #12

Closed saedr closed 4 years ago

saedr commented 4 years ago

Hi,

I tried to use your preprocessing scripts to generate required files for training, however the final product of the pre-processing steps is different from what it is available through the download link. In fact the data_loader.py cannot handle the output of the preprocessing steps. One clear difference is that the downloaded preprocessed files contain a field called passages while the output of preprocessing doesn't have that field instead it includes another field called pagerank_score.

Could you please elaborate on these fields?

Thank you very much!

bdhingra commented 4 years ago

Hi, the preprocessing code originally added only included the steps for constructing the sub-graph from the KB. To add text to the sub-graph a lucene pipeline needs to be run.

We have added the code for preprocessing the WikiMovies dataset in the wikimovie_preprocessing directory. The lucene pipeline is included here. You can follow a similar procedure to run it on the WebQuestionsSP dataset as well.