mandarjoshi90 / coref

BERT for Coreference Resolution
Apache License 2.0
441 stars 92 forks source link

question about subtoken_map #30

Closed Naman-ntc closed 4 years ago

Naman-ntc commented 4 years ago

Hi, I wanted to ask about subtoken_map in jsonlines file. In trial.jsonlines

{"doc_key": "bn/voa/02/voa_0210_0", 
"sentences": [["[CLS]", "Meanwhile", "Prime", "Minister", "E", "##hu", "##d", "Bar", "##ak", "told", "Israeli", "television", "he", "doubts", "a", "peace", "deal", "can", "be", "reached", "before", "Israel", "'", "s", "February", "6th", "election", ".", "He", "said", "he", "will", "now", "focus", "on", "suppress", "##ing", "Palestinian", "violence", ".", "[SEP]"]], 
"speakers": [["[SPL]", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "[SPL]"]], 
"clusters": [[]], 
"sentence_map": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2], 
"subtoken_map": [0, 0, 1, 2, 3, 3, 3, 4, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 30, 31, 32, 33, 33]
}

The subtoken_map is same (33) for "." and "[SEP]" tokens while in README example it differs for full-stop and [SEP] token. So I wanted to ask which of the convention is expected?

mandarjoshi90 commented 4 years ago

Please use the one in trial.jsonlines. I've fixed the README. Thanks for catching this.