google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
38.16k stars 9.6k forks source link

Using bert for Document Classification #650

Open sandeeppilania opened 5 years ago

sandeeppilania commented 5 years ago

How can i use BERT to fine tune for document classifications? Has anyone implemented it? Any example or lead would be really helpful. I want to use it for document which are way bigger than current max length(512 tokens).

hsm207 commented 5 years ago

Have you considered splitting the document into chunks of 512 tokens and then using the most common classification as the final classification?

EncoreOliver commented 5 years ago

Have you considered splitting the document into chunks of 512 tokens and then using the most common classification as the final classification?

Do you have a reference paper or blog link explains why? thx

hsm207 commented 5 years ago

@EncoreOliver This is a common workaround in text classification. Besides chunking the document people have also suggested summarizing the document into a length that the model can process. I can't think of a paper/blog off the top of my head right now.

stevewyl commented 5 years ago

@EncoreOliver https://arxiv.org/abs/1905.05583 The authors recommend using head and tail tokens of the document. I think that how to chunk documents depends on datasets.

chrisjmccormick commented 4 years ago

@hsm207 @stevewyl Thank you guys for your insights on this topic! I put together a tutorial and a Colab notebook on applying BERT to document classification here: https://youtu.be/_eSGWNqKeeY, and credit this thread for some of the ideas (at around [13:20]) :)

smdp2000 commented 3 years ago

@hsm207 @stevewyl Thank you guys for your insights on this topic! I put together a tutorial and a Colab notebook on applying BERT to document classification here: https://youtu.be/_eSGWNqKeeY, and credit this thread for some of the ideas (at around [13:20]) :)

can I get the colab notebook link