mozilla / fathom

A framework for extracting meaning from web pages
http://mozilla.github.io/fathom/
Mozilla Public License 2.0
1.97k stars 75 forks source link

Create Grand Unified Corpus or not #130

Open erikrose opened 5 years ago

erikrose commented 5 years ago

From time to time, we've kicked around the idea of making a big, unified corpus, possibly even labeled, reusable for new projects. We talked about it at a team meeting, and Daniel and I spent another hour exploring the design space. Here are our notes. Feel free to edit.

The intent of this ticket is to store these notes until we're ready to act on them. The ticket will be complete when we either build something like this or decide not to.

A note about my bulleting conventions: + is a pro, - is a con, and the absence of a sigil represents a neutral bullet.

Things we need to do with samples

Who is "we"?

Storage alternatives

biancadanforth commented 5 years ago

Sharing of corpora is by far the strongest reason why having a centralized repository would be helpful, as virtually all of our applications want to run on every webpage. I also think it would significantly lower the barrier for folks who want to use Fathom in their projects.

I added a bullet on your initial post under "Version with the rulesets and rubrics?"

  • We want to understand how the page may change over time; i.e. when is our ruleset stale?

I really like this option: "A super-repo (etc.), and decouple labels into separate files". If we were the centralized keepers of these "clean" samples, we could periodically crawl the web, visit and re-freeze these pages and do some analysis on how frequently and in what ways they change over time. This would help us provide some guidance to folks who want to maintain a Fathom ruleset in a production environment.

erikrose commented 5 years ago

I do have Vlad collecting a page for each of the top N Tranco sites, for use as a source of negative samples.

danielhertenstein commented 5 years ago

Hooray, Vlad!