seomoz / dragnet_data

Training/test data for Dragnet
GNU Affero General Public License v3.0
41 stars 14 forks source link

questions re: creating my own data #2

Open bdewilde opened 7 years ago

bdewilde commented 7 years ago

Hi, I'm currently compiling additional, more modern html documents with gold standard content + comments for use in training dragnet models, and I have a few questions:

Thanks for your help!

matt-peters commented 7 years ago

All those will depend on what your final goals are for the model. The type of training data you collect and use will determine the final biases of your model. I'd say if you want those things to be considered content then add them to the gold standard, otherwise exclude. In our case, we made a decision to use three classes of output for each block (chrome, content and comments) based on product requirements at the time. If I were to collect data again I'd be tempted to add more classes like some of the things you mentioned (byline, pubdate, etc.) to capture more information.

bdewilde commented 7 years ago

Sure, makes sense. I'm really only interested in "content", but its definition is a bit subjective, and could certainly be broken up into subclasses. I don't think dragnet's data processing and training code is set up to handle any additional classes such as "content metadata" (byline, pubdate, and such), right?

matt-peters commented 7 years ago

Not directly but it wouldn't be too hard to add. You could probably just replace the block_model with a multiclass classifier and it would mostly work.

bdewilde commented 7 years ago

Okay, one more (last?) question: How important is it to maintain whitespace when manually copy/pasting the content data from the raw html rendered in a browser, or is that all handled in dragnet's data_processing.py? Things get tricky when there are images and such between blocks of content...

matt-peters commented 7 years ago

It isn't important - dragnet tokenizes the gold standard and web page text then aligns the blocks to gold standard based on overlapping tokens. So as long as you leave some space between each content block it'll handle the test.