speakleash / speakleash-instruct-creator

Generate instructions datasets for the fine-tuning purposes.
3 stars 5 forks source link

Law based instructions #50

Open IgorTest19 opened 4 months ago

IgorTest19 commented 4 months ago

Create instructions set based on the Carbon Border Adjustment Mechanism (CBAM) Questions and Answers document: https://taxation-customs.ec.europa.eu/document/download/013fa763-5dce-4726-a204-69fec04d5ce2_en?filename=CBAM_Questions%20and%20Answers.pdf

Example of main fields (depends on the type of the implemented set: instructions, functions, conversations, etc.) for the instructions type:

{
  "instruction": "Why is the EU putting in place a Carbon Border Adjustment Mechanism?",
  "input": "Text on which answer will be based on (if exists) else leave empty string",
  "output": "The EU is at the forefront of international efforts..."
},

It would be beneficial to include any fields with metadata, such as:

"source_name": "The name of the resource used for the dataset creation, if any were used."
"source_url": "The URL of the used source datasets, if any were used."
"source_description": "A short description of the used dataset: what it is about, the purpose of creation, authors."
"script_name": "If the script generating the dataset is reusable and you want to share it with us by committing to our repository."
"status": "If the instruction has been already manually verified, you can set the status as "ok". If not, leave the field as an empty string or None."
"updated_by": "If the instruction has already been manually verified, leave your name/nickname in this field. It will help us to give thanks :)"
"id": "numeric identifier for the dataset entry"

Any other metadada fields including useful information are welcome to be included

PinusSilvestris commented 4 months ago

Draft uploaded: https://d6t0.c15.e2-2.dev/speakleash-exchange-pub/CBAN_Q_and_A.jsonl

IgorTest19 commented 4 months ago

The file containing instructions has been checked and looks good. We may consider adding field with metadata such as status, updated_by and id what would be helpful. Currently generated instructions will be manually verified and we may consider translating this instructions set.