speakleash / speakleash-instruct-creator

Generate instructions datasets for the fine-tuning purposes.
3 stars 5 forks source link

New Instructions Dataset: Text Summarization #45

Open IgorTest19 opened 3 months ago

IgorTest19 commented 3 months ago

Create a new instructions dataset with texts summaries. Simple example:

{
  "instruction": "Podsumuj poniższy tekst",
  "input": "Rano było dosyć chłodno, ale bardzo szybko się rozgrzało i zrobiło się naprawdę gorąco. W południe upał był niemal nie do zniesienia, szczególnie w mieście gdzie nawet lekki wiaterek nie przynosił ulgi.",
  "output": "Opis szybkiej zmiany pogody od chłodnego poranka do znacznego wzrostu temperatury w południe."
},

It would be beneficial to include any fields with metadata, such as:

"source_name": "The name of the resource used for the dataset creation, if any were used."
"source_url": "The URL of the used source datasets, if any were used."
"source_description": "A short description of the used dataset: what it is about, the purpose of creation, authors."
"script_name": "If the script generating the dataset is reusable and you want to share it with us by committing to our repository."
"status": "If the instruction has been already manually verified, you can set the status as "ok". If not, leave the field as an empty string or None."
"updated_by": "If the instruction has already been manually verified, leave your name/nickname in this field. It will help us to give thanks :)"
"id": "numeric identifier for the dataset entry"

Any other metadada fields including useful information are welcome to be included

jansowa commented 3 months ago

Several types of new datasets (listed from the most important) are worth seeking out:

  1. ready-made Polish texts with summaries
  2. short documents whose summaries will be created automatically
  3. English-language texts with summaries, which we can translate automatically

The quality of the third solution is likely to be the lowest, hence it is not worth spending too much time on such approach.