Ayyub has text of news articles from newsbank, these should be processed and cleaned so that they match, as much as possible, the formatting of the news stories that go into the news classifier.
Notes:
we should be able to extract text without having to do OCR, the command line app pdftotext (which is part of the poppler install) does this, and outputs lines of text. I've also used the pdftools package in R, which is just an R package that calls poppler's pdftotext.
from there, we need to remove header/footer stuff (the part that says "Newsbank" etc), and split the headline from the rest of the text.
Ayyub has text of news articles from newsbank, these should be processed and cleaned so that they match, as much as possible, the formatting of the news stories that go into the news classifier.
Notes:
pdftotext
(which is part of the poppler install) does this, and outputs lines of text. I've also used thepdftools
package in R, which is just an R package that calls poppler's pdftotext.