This PR scapes training data from foodista, a shared collection of recipes and information about cooking tools, techniques, and ingredients that is distributed under the CC-BY 3.0 license.
Data is collected in 4 steps:
an index of pages is built from the sitemap
all pages are downloaded
the pages are converted to dolma examples, with the raw html as the "text" key.
This PR scapes training data from foodista, a shared collection of recipes and information about cooking tools, techniques, and ingredients that is distributed under the CC-BY 3.0 license.
Data is collected in 4 steps:
"text"
key.closes https://github.com/r-three/licensed-pile/issues/11