speakleash / speakleash-instruct-creator

Generate instructions datasets for the fine-tuning purposes.
3 stars 5 forks source link


# Speakleash Training and Tuning Datasets This repository provides tools for generating datasets for training and tuning Large Language Models (LLMs). ## Overview Datasets are divided into three main categories: - **[instructions](https://github.com/speakleash/speakleash-instruct-creator/tree/main/instructions)** - **[conversations](https://github.com/speakleash/speakleash-instruct-creator/tree/main/conversations)** - **[functions](https://github.com/speakleash/speakleash-instruct-creator/tree/main/functions)** Each category consists of the following types of content: - **automated** - **manual** - **samples** ### Content types description: - #### automated Datasets are fully generated by scripts. The process of downloading data, generating datasets, and saving them is entirely handled by scripts. - #### manual Only part of the dataset generation process is automated. Further human intervention is required to complete the datasets. - #### samples Examples of the generated datasets (up to 3 records). ## Usage Each category (instructions, conversations, functions) has its own directory, containing subdirectories for automated, manual, and sample datasets. Inside each subdirectory, you will find examples and explanations of how each type of dataset should be structured. ## Generated datasets files: Instructions:
Released instructions version: 2024_03_07_v0_0_13 (expandable list with download links):
All generated instructions in one JSONL file:
[speakleash_pl_instructions_2024_03_07_v0_0_13.jsonl](https://d6t0.c15.e2-2.dev/speakleash-instructions-pub/speakleash_pl_instructions_2024_03_07_v0_0_13.jsonl) All generated instructions in one JSONL file (Alpaca format):
[speakleash_pl_instructions_alpaca_2024_03_07_v0_0_13.jsonl](https://d6t0.c15.e2-2.dev/speakleash-instructions-pub/speakleash_pl_instructions_alpaca_2024_03_07_v0_0_13.jsonl) All generated instructions in one parquet file (Alpaca format):
[speakleash_pl_instructions_alpaca_2024_03_07_v0_0_13.parquet](https://d6t0.c15.e2-2.dev/speakleash-instructions-pub/speakleash_pl_instructions_alpaca_2024_03_07_v0_0_13.parquet) All generated instructions JSON files packed into one zip file:
[instructions_not_merged_2024_03_07_v0_0_13.zip](https://d6t0.c15.e2-2.dev/speakleash-instructions-pub/instructions_not_merged_2024_03_07_v0_0_13.zip) Or using terminal commands:
- For Linux:
`wget https://d6t0.c15.e2-2.dev/speakleash-instructions-pub/speakleash_pl_instructions_2024_03_07_v0_0_13.jsonl` - For Windows:
`curl -O https://d6t0.c15.e2-2.dev/speakleash-instructions-pub/speakleash_pl_instructions_2024_03_07_v0_0_13.jsonl`
## Contribution To contribute, clone this repository and add a new scripts (e.g., `allegro-summarization.py`) to the chosen directory (instructions, conversations, functions). If you identify additional types of training datasets that should be included, please contribute by creating an issue in the repository. New sections containing the proposed types will be added based on feedback and discussion. #### Further details how to create instructions: https://docs.google.com/document/d/1GZXCLx_Wb2QnAaqPPp0USHhmtBcrNSWCYwHU6r9RSLM/edit ## Working with code: #### Datasets Internal datasets from the `Speakleash` package are downloaded separately to the `data_speakleash` directory. This temporary solution is implemented due to the current version of the `Speakleash` package. The `manifests` files are downloaded automatically to the same directory as `datasets`, so separating both directories was done for better readability. This functionality was done with the purpose but we are working on some changes, described in this [issue](https://github.com/speakleash/speakleash/issues/10). #### Workflow directories Instruction files are generated in the `output` directory. External datasets are downloaded to the `data` directory. #### Output To generate one final instructions JSON file, merge them using the `merge_files.py` script. It will be created in the directory called `instructions_merged_and_stats` along with statistical files describing the instructions data. To update instruction samples, run the `generate_samples.py` script. It will generate JSON files with three records each. ## Important Information - `sentiment_detection.py` -> requires HuggingFace token. - `orca_math_create_english_docx.py` with `orca_math_create_json_from_docx.py` -> the scripts need to be self-translated in an external service, so they are not included in `merge_files.py`. More information inside these scripts. - `speakleash_forums_questions.py` -> if installed requirements won't work, follow the steps included in this documentation: [StyloMetrix](https://github.com/ZILiAT-NASK/StyloMetrix) - If you are facing problems with dependencies, execute manual installation of the following libraries: `pip install http://mozart.ipipan.waw.pl/~rtuora/spacy/pl_nask-0.0.7.tar.gz` `pip install https://github.com/explosion/spacy-models/releases/download/pl_core_news_md-3.7.0/pl_core_news_md-3.7.0-py3-none-any.whl`
*It is a temporary solution but will work.*