Open erikrose opened 5 years ago
Sharing of corpora is by far the strongest reason why having a centralized repository would be helpful, as virtually all of our applications want to run on every webpage. I also think it would significantly lower the barrier for folks who want to use Fathom in their projects.
I added a bullet on your initial post under "Version with the rulesets and rubrics?"
- We want to understand how the page may change over time; i.e. when is our ruleset stale?
I really like this option: "A super-repo (etc.), and decouple labels into separate files". If we were the centralized keepers of these "clean" samples, we could periodically crawl the web, visit and re-freeze these pages and do some analysis on how frequently and in what ways they change over time. This would help us provide some guidance to folks who want to maintain a Fathom ruleset in a production environment.
I do have Vlad collecting a page for each of the top N Tranco sites, for use as a source of negative samples.
Hooray, Vlad!
From time to time, we've kicked around the idea of making a big, unified corpus, possibly even labeled, reusable for new projects. We talked about it at a team meeting, and Daniel and I spent another hour exploring the design space. Here are our notes. Feel free to edit.
The intent of this ticket is to store these notes until we're ready to act on them. The ticket will be complete when we either build something like this or decide not to.
A note about my bulleting conventions: + is a pro, - is a con, and the absence of a sigil represents a neutral bullet.
Things we need to do with samples
Who is "we"?
Storage alternatives