speakleash / speakleash-instruct-creator

Generate instructions datasets for the fine-tuning purposes.
3 stars 5 forks source link

Verification of scripts with text summarizations #46

Open jansowa opened 5 months ago

jansowa commented 5 months ago

We currently have three scripts creating instructions with text summarizations: https://github.com/speakleash/speakleash-instruct-creator/blob/main/instructions_scripts/polish-news-summarization.py https://github.com/speakleash/speakleash-instruct-creator/blob/main/instructions_scripts/polish-summaries-corpus.py https://github.com/speakleash/speakleash-instruct-creator/blob/main/instructions_scripts/allegro-summarization.py

They should be reviewed for errors, formatting, fitting into the correct JSON format, etc.

jansowa commented 3 months ago

Scripts have changed their places.

  1. Polish news summarization: https://github.com/speakleash/speakleash-instruct-creator/blob/main/instructions/automated/polish-news-summarization.py
    • The text can contain more than 30 (!) of whitespace characters in a row, it will be easy to fit it with simple regex
  2. Polish summaries corpus: https://github.com/speakleash/speakleash-instruct-creator/blob/main/instructions/automated/polish-summaries-corpus.py
    • The "instruct" field has the value "Streść podany tekst" in each entry. This should be more differentiated.
    • Each article occurs on average ~13.7 times (but with different summaries). We are still wondering what to do about this problem.
  3. Allegro summarization
    • As above - "instruct" field should be more differentiated
    • There is a bit of residue from certain HTML elements, such as. "box:pin", "box:imagePins", "box"offerCarousel" and others starting with "box:". They should be removed.
jansowa commented 3 months ago

I am working on some of these problems at https://github.com/jansowa/speakleash-instruct-creator/tree/46-summarization-scripts-verification