nlpyang / PreSumm

code for EMNLP 2019 paper Text Summarization with Pretrained Encoders
MIT License
1.28k stars 463 forks source link

Regarding Separating Sentences. #48

Closed tahmedge closed 4 years ago

tahmedge commented 5 years ago

Hi, how did you separate different sentences in an Article? I mean which token was used to separate them?

astariul commented 5 years ago

If document is :

Sentence1. Sentence2. Sentence3.

Then it will be separated with [CLS] token and [SEP] token, like this :

[CLS] Sentence1. [SEP] [CLS] Sentence2. [SEP] [CLS] Sentence3. [SEP]

tahmedge commented 5 years ago

Thanks for your reply. So it means it is separated by ' . ' token?

astariul commented 5 years ago

No. Sorry my answer was not clear.

In the dataset, there is a sentence every line. So the StanFord tokenizer use the newline as a sentence separator :

https://github.com/nlpyang/PreSumm/blob/29a6b1ace2290808f39c76ae2ef0e92d515fc049/src/prepro/data_builder.py#L123-L125

And then, when the dataset is processed, the author put [CLS] token and [SEP] token as separator between sentences.

tahmedge commented 5 years ago

Thanks. Btw, I am also confused regarding the decoding phase. What I understand is that at the end of the encoding phase, we will have [CLS] [S1, T1][S1 T,2].... [SEP] [CLS] [S2 ,T1] [S2 T,2].... [SEP] [CLS] [S3 ,T1] [S3 .T2].... [SEP] ..... Here, S1, S2, S3 are the sentences whereas T1 , T2 .... are the tokens of the sentences. So during decoding phase, is not the decoder focusing on all the tokens including [CLS] [Si, Tj] [SEP] etc. ? Or is it just focusing on [CLS] token of each sentence? I did not find detailed explanation regarding decoding phase in the paper.

astariul commented 5 years ago

I'm not sure what you mean by encoding and decoding phase...

But anyway only tokens of the sentences are kept for the model. For Extractive summarization, only [CLS] tokens are used, while for Abstractive summarization, all tokens representations are used.

tahmedge commented 4 years ago

Does the newline in the summary separated by token?