Test set - Githubissues

mswellhao / PacSum

Unsupervised Extractive Summarization based on Position-Augmented Centrality

124 stars 27 forks source link

Test set #7

Open DishaJindal opened 4 years ago

DishaJindal commented 4 years ago

Hi, Thanks for sharing the repo and the dataset. Would it be possible to share the document ids of the documents in the test split ("nyt.test.h5df") of the NYT dataset?

kgarg8 commented 3 years ago

Not sure what you mean by document ids!!

Here's a sample script to read the h5df file

import h5py
import json
filename = "../data/NYT/nyt.test.h5df"

with h5py.File(filename, "r") as f:
    a_group_key = list(f.keys())[0]
    data = list(f[a_group_key])

res = json.loads(data[0])

Do res.keys() to see the keys and then use you can extract data on the terminal in the following way:

res['article'][0]