togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.57k stars 350 forks source link

fixed some errors in Makefile for lm preparation #36

Open feifeibear opened 1 year ago

feifeibear commented 1 year ago
  1. install sentencepiece from github repo. I can not run the .zip version on my MacOS.
  2. make some necessary directories during make
  3. cache the wiki json.gz if has already been downloaded
feifeibear commented 1 year ago

@mauriceweber could you review this PR?

mauriceweber commented 1 year ago

Hi @feifeibear , thanks a lot for your PR! I will look into it:)