Open sandeeppilania opened 5 years ago
Have you considered splitting the document into chunks of 512 tokens and then using the most common classification as the final classification?
Have you considered splitting the document into chunks of 512 tokens and then using the most common classification as the final classification?
Do you have a reference paper or blog link explains why? thx
@EncoreOliver This is a common workaround in text classification. Besides chunking the document people have also suggested summarizing the document into a length that the model can process. I can't think of a paper/blog off the top of my head right now.
@EncoreOliver https://arxiv.org/abs/1905.05583 The authors recommend using head and tail tokens of the document. I think that how to chunk documents depends on datasets.
@hsm207 @stevewyl Thank you guys for your insights on this topic! I put together a tutorial and a Colab notebook on applying BERT to document classification here: https://youtu.be/_eSGWNqKeeY, and credit this thread for some of the ideas (at around [13:20]) :)
@hsm207 @stevewyl Thank you guys for your insights on this topic! I put together a tutorial and a Colab notebook on applying BERT to document classification here: https://youtu.be/_eSGWNqKeeY, and credit this thread for some of the ideas (at around [13:20]) :)
can I get the colab notebook link
How can i use BERT to fine tune for document classifications? Has anyone implemented it? Any example or lead would be really helpful. I want to use it for document which are way bigger than current max length(512 tokens).