The "instruct" field has the value "Streść podany tekst" in each entry. This should be more differentiated.
Each article occurs on average ~13.7 times (but with different summaries). We are still wondering what to do about this problem.
Allegro summarization
As above - "instruct" field should be more differentiated
There is a bit of residue from certain HTML elements, such as. "box:pin", "box:imagePins", "box"offerCarousel" and others starting with "box:". They should be removed.
We currently have three scripts creating instructions with text summarizations: https://github.com/speakleash/speakleash-instruct-creator/blob/main/instructions_scripts/polish-news-summarization.py https://github.com/speakleash/speakleash-instruct-creator/blob/main/instructions_scripts/polish-summaries-corpus.py https://github.com/speakleash/speakleash-instruct-creator/blob/main/instructions_scripts/allegro-summarization.py
They should be reviewed for errors, formatting, fitting into the correct JSON format, etc.