Why tokenizing 2 times ?

astariul commented 4 years ago

Data is tokenized 2 times :

With Stanford CoreNLP : https://github.com/nlpyang/PreSumm/blob/ba17e95de8cde9d5ddaeeba01df7cace584511b2/src/prepro/data_builder.py#L110
With HuggingFace's BertTokenizer : https://github.com/nlpyang/PreSumm/blob/ba17e95de8cde9d5ddaeeba01df7cace584511b2/src/prepro/data_builder.py#L246

Isn't this double tokenization harmful ?

I feel like the main reason behind this is a legacy reason : previous codebase used Stanford CoreNLP, so in order to have comparable results we keep using it. Am I right ?

ShirleyHan6 commented 4 years ago

It seems that Stanford CoreNLP is only for tokenizing into sentences while BertTokenizer is for tokenizing into subwords to produce the vocabulary. https://github.com/nlpyang/PreSumm/blob/ba17e95de8cde9d5ddaeeba01df7cace584511b2/src/prepro/data_builder.py#L123

astariul commented 4 years ago

While I agree BertTokenizer is compulsory for subtokens, Stanford CoreNLP does not only split sentence. For example, it convert ( into -lrb-.

Another example is $10,000 which will be tokenized by CoreNLP into $, 10, ,, 000.
Maybe BertTokenizer would tokenize it differently. That's why I think it's harmful.

fseasy commented 4 years ago

because standard summarization evaluation (may be from PointerGenerator) need the Stanford NLP tokenization. see BART, it don't tokenize twice for training/infer, but before evaluation, it will use StanfordNLP tokenizer. So in my view, here tokenizing twice will make evaluation easily.

wailoktam commented 4 years ago

Can anyone enlighten me on these tokenization questions:

When we use a pretrained model, shouldn't we stick to the tokenization steps used during pretraining for the sake of consistency? I dont think I read anything about using Stanford corenlp in bert documentation. What justifies the tokenization by stanford corenlp step before using the tokenization comes with bert "bert-base-uncased"?

because standard summarization evaluation (may be from PointerGenerator) need the Stanford NLP tokenization. see BART, it don't tokenize twice for training/infer, but before evaluation, it will use StanfordNLP tokenizer. So in my view, here tokenizing twice will make evaluation easily.

Will this code work or even work better without the tokenization by Stanford corenlp step? I dont quite understand the role of corenlp in evaluation and whether evaluation include the validation during training here. If it includes validation and corenlp is necessary for it to work, then it means the code wont work if we remove the corenlp tokenization step.

3.

It seems that Stanford CoreNLP is only for tokenizing into sentences while BertTokenizer is for tokenizing into subwords to produce the vocabulary. https://github.com/nlpyang/PreSumm/blob/ba17e95de8cde9d5ddaeeba01df7cace584511b2/src/prepro/data_builder.py#L123

If corenlp is meant to be a sentence splitter, can we replace it with something else that only do the sentence splitting but not the word segmentation? As other replies point out, with corenlp doing both, conflict with the tokenizer comes with bert is suspected. Spacy has got this sentence splitting only option, I think. Does it mean that you need to have a sentence splitting step before applying the bert tokenizer?

I am basically interested in converting this piece of code for use with another language. If the role of each of the tokenization step is more clear cut, I can probably do the conversion better. Some Japanese tokenizer comes with sentence splitting. Some not. If the corenlp step is meant to be sentence stepping, then everything makes sense and I know what to do.

thamnt2611 commented 2 years ago

I found the answer of author in this link: https://github.com/nlpyang/BertSum/issues/13

nlpyang / PreSumm

Why tokenizing 2 times ? #66