mingzhu0527 / MASHQA

Apache License 2.0
26 stars 8 forks source link

incorrect sentence_starts in data? #8

Open TalitaGroetzinger opened 1 year ago

TalitaGroetzinger commented 1 year ago

Hi!

I am confused about the ['sent_starts'] in the data files:

data["data"][3]["paragraphs"][0]["sent_starts"]
[[0, 46], [47, 76], [124, 110], [235, 484], [720, 679], [1400, 85], [1486, 24], [1511, 53], [1565, 226], [1792, 30], [1823, 123], [1947, 20], [1968, 192], [2161, 77], [2239, 152], [2392, 147], [2540, 46], [2587, 51], [2639, 63], [2703, 121], [2825, 265], [3091, 218], [3310, 105], [3416, 279]]

I would have assumed that the first element of each sublist is the start of the sentence in 'context', and the second element would be the end. Therefore, the second element should be a higher number than the first element. However, starting from [1400, 85], this is not the case anymore. I really wonder why this is the case.

Thanks a lot!