sambanova / generative_data_prep

Apache License 2.0
58 stars 7 forks source link

Creation of jsonl files #33

Closed snova-darshang closed 3 months ago

snova-darshang commented 1 year ago

https://github.com/sambanova/generative_data_prep#input-format

the input format instructions can be improved upon. A. can we we provide examples for what is format for general ML usecases such as 1. pretrainined, 2.finetuning and 3.inference. B. are there suggested methods to cleanup raw data/text or pointers for common practices. that can be highlighted? C. Can we explain restrictions on contents of "prompt" and "completion"? such as maximum or minimum input length, what it should/shouldnot contain.

snova-zoltanc commented 1 year ago

Thank you so much for your feedback! We will Add more documentation and explanations around recommended input data.

github-actions[bot] commented 3 months ago

This Issue is stale because it has been open for 6 months with no activity. Remove stale label or comment or this issue will be closed in 30 days.

snova-zoltanc commented 3 months ago

Documentation has been updated in PR https://github.com/sambanova/generative_data_prep/pull/90

suggested methods to cleanup raw data/text or pointers for common practices is out of scope of this repository