Open bdewilde opened 7 years ago
All those will depend on what your final goals are for the model. The type of training data you collect and use will determine the final biases of your model. I'd say if you want those things to be considered content then add them to the gold standard, otherwise exclude. In our case, we made a decision to use three classes of output for each block (chrome, content and comments) based on product requirements at the time. If I were to collect data again I'd be tempted to add more classes like some of the things you mentioned (byline, pubdate, etc.) to capture more information.
Sure, makes sense. I'm really only interested in "content", but its definition is a bit subjective, and could certainly be broken up into subclasses. I don't think dragnet
's data processing and training code is set up to handle any additional classes such as "content metadata" (byline, pubdate, and such), right?
Not directly but it wouldn't be too hard to add. You could probably just replace the block_model
with a multiclass classifier and it would mostly work.
Okay, one more (last?) question: How important is it to maintain whitespace when manually copy/pasting the content data from the raw html rendered in a browser, or is that all handled in dragnet's data_processing.py
? Things get tricky when there are images and such between blocks of content...
It isn't important - dragnet tokenizes the gold standard and web page text then aligns the blocks to gold standard based on overlapping tokens. So as long as you leave some space between each content block it'll handle the test.
Hi, I'm currently compiling additional, more modern html documents with gold standard content + comments for use in training dragnet models, and I have a few questions:
Thanks for your help!