spacemanidol / MSMARCO

Utilities, Baselines, Statistics and Descriptions Related to the MSMARCO DATASET
MIT License
189 stars 41 forks source link

[docs] Data is not JSONL #9

Closed juharris closed 6 years ago

juharris commented 6 years ago

The docs say that the 2.1 data is in JSONL format but the training data I downloaded from http://www.msmarco.org/dataset.aspx is not:

$ wc -l train_v2.1.json
0 train_v2.1.json

Similarly:

i = 0
with open('train_v2.1.json') as f:
    for l in f:
        i += 1
print(i) # 1
spacemanidol commented 6 years ago

Whoops. My oversight. You are correct. Files are json format but the tojsonl will convert files into jsonl.

I have updated the docs to reflect this better. thanks.