r-three / common-pile

Repo to hold code and track issues for the collection of permissively licensed data
MIT License
22 stars 6 forks source link

Scape examples from foodista #69

Closed blester125 closed 5 months ago

blester125 commented 5 months ago

This PR scapes training data from foodista, a shared collection of recipes and information about cooking tools, techniques, and ingredients that is distributed under the CC-BY 3.0 license.

Data is collected in 4 steps:

  1. an index of pages is built from the sitemap
  2. all pages are downloaded
  3. the pages are converted to dolma examples, with the raw html as the "text" key.
  4. The html is parsed as part of a dolma processor.

closes https://github.com/r-three/licensed-pile/issues/11