Create Grand Unified Corpus or not

erikrose commented 5 years ago

From time to time, we've kicked around the idea of making a big, unified corpus, possibly even labeled, reusable for new projects. We talked about it at a team meeting, and Daniel and I spent another hour exploring the design space. Here are our notes. Feel free to edit.

The intent of this ticket is to store these notes until we're ready to act on them. The ticket will be complete when we either build something like this or decide not to.

A note about my bulleting conventions: + is a pro, - is a con, and the absence of a sigil represents a neutral bullet.

Things we need to do with samples

Vectorize
Evaluate (debug)
Examine source and rendered page for rule creation and remediation
Version with the rulesets and rubrics?
- How important is it really to be able to go back in time and reproduce training results? Use cases:
  - It turns out fathom-train was buggy.
  - We change accuracy metrics and want new-style numbers on old data and code. (Why?)
  - We adjust the rubric and want to version label changes with it. (Why?)
  - We want to understand how the page may change over time; i.e. when is our ruleset stale?
Make available for reuse
Add/modify labels

Who is "we"?

First-party: Fathom team
Second-party: other Mozilla teams using Fathom
Third-party: contractors doing corpus collection

Storage alternatives

Separate repos, as now
- +Tooling already written
- -Hard to know what corpora are available for reuse
  - Could be ameliorated with a central list of rubrics with hyperlinks to repos (simplest thing that could possibly work)
- +No servers to run
- -Using samples for multiple rulesets leads to redundant storage of those samples
A super-repo (or server or bucket or whatever), and copy samples to individual repos
A super-repo (etc.), and decouple labels into separate files versioned in individual repos
- +No namespacing issues (when same-named tags are used with different meanings on different projects)
- +Saves space
- -Requires more tooling to be written, taught, and learnt
- "Label files" could be tuple-store-like triples of (sample ID or URL, Simmer selector or dotted path or something, label).
- Samples could be divided into testing, training, and validation chunks by listing them in different files. Some of our tooling would have to be changed to accomodate that.
Monolithic app. We could write a giant app that gives you a fulltext search of existing rubrics. Pick an interesting rubric, and you can navigate from there to a gallery of applicable samples. Select a bunch and throw them in your "shopping cart" or whatever. Then you can come back after doing some ruleset development and say "Gimme 10 more in my training bucket". The app would then either spit out vector files for you or else imbibe your ruleset and run the training itself, giving you back only coeffs.
- -Lots of tooling to write
- +Hopefully easier to use
- -Probably too soon to be confident enough in people's needs to invest in this

biancadanforth commented 5 years ago

Sharing of corpora is by far the strongest reason why having a centralized repository would be helpful, as virtually all of our applications want to run on every webpage. I also think it would significantly lower the barrier for folks who want to use Fathom in their projects.

I added a bullet on your initial post under "Version with the rulesets and rubrics?"

We want to understand how the page may change over time; i.e. when is our ruleset stale?

I really like this option: "A super-repo (etc.), and decouple labels into separate files". If we were the centralized keepers of these "clean" samples, we could periodically crawl the web, visit and re-freeze these pages and do some analysis on how frequently and in what ways they change over time. This would help us provide some guidance to folks who want to maintain a Fathom ruleset in a production environment.

erikrose commented 5 years ago

I do have Vlad collecting a page for each of the top N Tranco sites, for use as a source of negative samples.

danielhertenstein commented 5 years ago

Hooray, Vlad!

mozilla / fathom