meta-llama / llama3

The official Meta Llama 3 GitHub site
Other
27.12k stars 3.07k forks source link

List the "publicly available sources" 15T dataset list from Llama 3 #39

Open bennmann opened 7 months ago

bennmann commented 7 months ago

Llama 3 is not reproducible in any meaningful capacity without a list of the dataset sources.

Please release a list of the sources.

grothedev commented 7 months ago

related question: why train only on publicly available data from the internet? if you want quality language and good knowledge, wouldn't you want to train on things like textbooks, historical documents, scientific research papers, and the like? things that you could get in a library? i'm talking like classic fundamental knowledge. training on classical philosophy would probably improve reasoning skills. and training on the OG programming textbooks would be very good for programming.