google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
37.57k stars 9.54k forks source link

Can we use BERT for Punctuation Prediction? #346

Open dalonlobo opened 5 years ago

dalonlobo commented 5 years ago

Can we use the pre-trained BERT model for Punctuation Prediction for Conversational Speech? Let say punctuating an ASR output?

cvenour commented 5 years ago

I'm also interested in doing this.

cvenour commented 4 years ago

I don't think BERT can be used to predict next words (i.e. to be used as a Language Model). So I ended up having to use a fastai Language Model (pre-trained on wikitext-103) to predict whether the next token could be a punctuation mark or not. See https://youtu.be/qqt3aMPB81c?t=1544

thombrem commented 4 years ago

hello corkindrill!

  1. Why do you think BERT cannot be used to predict next words?
  2. Is that a fact or an opinion?
  3. If you think it is a fact, is there any evidence to back your hypothesis?
  4. What is the best and worst case F1 Score of fastai language model?
  5. How much training data is needed by fastai to become an effective English language model?

I am simply curious to know the answers to these questions. I am neither refuting or supporting the claim about BERT. Please do not treat these questions as either defensive or combative in nature.

Thx Milind

On Wed, Jan 29, 2020 at 7:41 AM corkindrill notifications@github.com wrote:

I don't think BERT can be used to predict next words (i.e. to be used as a Language Model). So I ended up having to use a fastai Language Model (pre-trained on wikitext-103) to predict whether the next token could be a punctuation mark or not. See https://youtu.be/qqt3aMPB81c?t=1544

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google-research/bert/issues/346?email_source=notifications&email_token=AB65CSO7DTO2MV2BD3DWJD3RADQUPA5CNFSM4GPDWS42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKFWHKQ#issuecomment-579560362, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB65CSN72JNRCTARMWG5BA3RADQUPANCNFSM4GPDWS4Q .

al-yakubovich commented 4 years ago

Actually we can https://github.com/nkrnrnk/BertPunc!

al-yakubovich commented 4 years ago

@cvenour prediction of whether the next token could be a punctuation mark or not seems an interesting idea as well! Could you share your code with us?

cvenour commented 4 years ago

My code is copied almost verbatim from this fastai notebook: https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson3-imdb.ipynb

The creator of that notebook loads a Language Model, which has been pre-trained on a corpus called wikitext-103, and then fine-tunes that Language Model on the corpus he actually cares about. Before turning that Language Model into a classifier, which is the main goal of his notebook (but not my main goal), he takes some time to experiment a bit with the fine-tuned Language Model to make predictions about what the next token might be, given an input sentence fragment.

So to see how to implement a Language Model, take a look at all the code in that notebook that precedes the section entitled "Classifier".

cvenour commented 4 years ago

Hi thombrem,

I asked Jacob Devlin, one of the creators of BERT if BERT could be used as a Language Model and he said no because it wasn't trained to do that sort of thing. But it looks like al-yakubovich has a solution for you which has somehow adapted BERT to make next token predictions.


From: Milind Thombre notifications@github.com Sent: January 29, 2020 3:43 AM To: google-research/bert bert@noreply.github.com Cc: corkindrill chrisvenour@hotmail.com; Comment comment@noreply.github.com Subject: Re: [google-research/bert] Can we use BERT for Punctuation Prediction? (#346)

hello corkindrill!

  1. Why do you think BERT cannot be used to predict next words?
  2. Is that a fact or an opinion?
  3. If you think it is a fact, is there any evidence to back your hypothesis?
  4. What is the best and worst case F1 Score of fastai language model?
  5. How much training data is needed by fastai to become an effective English language model?

I am simply curious to know the answers to these questions. I am neither refuting or supporting the claim about BERT. Please do not treat these questions as either defensive or combative in nature.

Thx Milind

On Wed, Jan 29, 2020 at 7:41 AM corkindrill notifications@github.com wrote:

I don't think BERT can be used to predict next words (i.e. to be used as a Language Model). So I ended up having to use a fastai Language Model (pre-trained on wikitext-103) to predict whether the next token could be a punctuation mark or not. See https://youtu.be/qqt3aMPB81c?t=1544

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google-research/bert/issues/346?email_source=notifications&email_token=AB65CSO7DTO2MV2BD3DWJD3RADQUPA5CNFSM4GPDWS42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKFWHKQ#issuecomment-579560362, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB65CSN72JNRCTARMWG5BA3RADQUPANCNFSM4GPDWS4Q .

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/google-research/bert/issues/346?email_source=notifications&email_token=ACYKQGQWXVNF7IYSBFN5HYDRAFMVJA5CNFSM4GPDWS42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKGXZIA#issuecomment-579697824, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACYKQGXM3F4D5BH5GBSZTMLRAFMVJANCNFSM4GPDWS4Q.

thombrem commented 4 years ago

Dear corkindrill,

In case you have a record (speech/text/video/or a combination of the above) of the conversation with Delvin, could you please share the original communication? For my training and quality purposes? Of course, you must respect Delvin's and the other authors of BERT's privacy concerns.so please check with them before making any such communication public.

On the other hand, please do post the acceptance or refusal to publish response in your own words, unless that is deemed a violation of privacy as well. :)

~Milind

On Wed, Jan 29, 2020 at 11:26 PM corkindrill notifications@github.com wrote:

Hi thombrem,

I asked Jacob Devlin, one of the creators of BERT if BERT could be used as a Language Model and he said no because it wasn't trained that way. But it looks like al-yakubovich has a solution for you which has somehow adapted BERT to make next token predictions.


From: Milind Thombre notifications@github.com Sent: January 29, 2020 3:43 AM To: google-research/bert bert@noreply.github.com Cc: corkindrill chrisvenour@hotmail.com; Comment < comment@noreply.github.com> Subject: Re: [google-research/bert] Can we use BERT for Punctuation Prediction? (#346)

hello corkindrill!

  1. Why do you think BERT cannot be used to predict next words?
  2. Is that a fact or an opinion?
  3. If you think it is a fact, is there any evidence to back your hypothesis?
  4. What is the best and worst case F1 Score of fastai language model?
  5. How much training data is needed by fastai to become an effective English language model?

I am simply curious to know the answers to these questions. I am neither refuting or supporting the claim about BERT. Please do not treat these questions as either defensive or combative in nature.

Thx Milind

On Wed, Jan 29, 2020 at 7:41 AM corkindrill notifications@github.com wrote:

I don't think BERT can be used to predict next words (i.e. to be used as a Language Model). So I ended up having to use a fastai Language Model (pre-trained on wikitext-103) to predict whether the next token could be a punctuation mark or not. See https://youtu.be/qqt3aMPB81c?t=1544

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub < https://github.com/google-research/bert/issues/346?email_source=notifications&email_token=AB65CSO7DTO2MV2BD3DWJD3RADQUPA5CNFSM4GPDWS42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKFWHKQ#issuecomment-579560362 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AB65CSN72JNRCTARMWG5BA3RADQUPANCNFSM4GPDWS4Q

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub< https://github.com/google-research/bert/issues/346?email_source=notifications&email_token=ACYKQGQWXVNF7IYSBFN5HYDRAFMVJA5CNFSM4GPDWS42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKGXZIA#issuecomment-579697824>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/ACYKQGXM3F4D5BH5GBSZTMLRAFMVJANCNFSM4GPDWS4Q

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google-research/bert/issues/346?email_source=notifications&email_token=AB65CSPCKHIJXHTB4RMZBNDRAG7MVA5CNFSM4GPDWS42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKIEPZQ#issuecomment-579880934, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB65CSKTVXB5FZQ2VEO3AS3RAG7MVANCNFSM4GPDWS4Q .

delltower commented 4 years ago

unsubscribe

At 2020-01-29 06:56:24, "Alexander" notifications@github.com wrote:

also interested

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.