How do you deal with WordPiece tokenization?

RichardHWD commented 4 years ago

BERT will divide word into many subwords, do you make them together in some way to a entire representation, or leave the cluster larger and predict more tokens? And where are the codes in this rp? It bother me for a long time ......

lujiaying commented 4 years ago

I have the same confusion.

Could you also provide an example of dealing with long documents? For instance, how to divide the sentence, where to add [SEP] (at the end of each sentence withing an article)?

lujiaying commented 4 years ago

I tried a multiple-sentences case in the following way. And it seems work.

{"clusters": [], 
  "doc_key": "bn", "sentences": [["[CLS]", "Meanwhile", "Prime", "Minister", "E", "##hu", "##d", "Bar", "##ak", "told", "Israeli", "television", "he", "doubts", "a", "peace", "deal", "can", "be", "reached", "before", "Israel", "'", "s", "February", "6th", "election", ".", "[SEP]"], ["[CLS]", "He", "said", "he", "will", "now", "focus", "on", "suppress", "##ing", "Palestinian", "violence", ".", "[SEP]"]], 
  "speakers": [["[SPL]", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "[SPL]"], ["[SPL]", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "[SPL]"]], 
  "sentence_map": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
  "subtoken_map": [0, 0, 1, 2, 3, 3, 3, 4, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 23, 0, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, 9, 10, 10]
}

RichardHWD commented 4 years ago

For example, "E", "##hu", "##d" are subwords of "Ehud", they perhaps are linked to "him" somewhere. I wonder weather calculating score for an antecedent and a span with these there subwords ("E", "##hu", "##d" ) is more complex than a span with only one single word ("Ehub"). Because the former needs to handle a longer representation. Actually, you use the first solution but how do you get subtoken_map? Do you have data pre-processing code? OntoNote 5.0 doesn`t contain such tag.

mandarjoshi90 commented 4 years ago

Please look at minimize.py which does the bookkeeping.

https://github.com/mandarjoshi90/coref/blob/master/minimize.py

RichardHWD commented 4 years ago

@mandarjoshi90 Thank your so much!

mandarjoshi90 / coref

How do you deal with WordPiece tokenization? #25