speakleash / speakleash-instruct-creator

Generate instructions datasets for the fine-tuning purposes.
4 stars 6 forks source link

Instructions fixes #21

Open IgorTest19 opened 8 months ago

IgorTest19 commented 8 months ago

1) polish-summaries-corpus.py - fill source field with TODO label (done in PR#23 ) 2) speakleash-categorization.py- change duplicated information in the output (done in PR#23) 3) speakleash_forums_questions.py - remove duplicated text in the source_name filed (done PR#23) 4) speakleash_forums_questions.py- remove ordinal numbers from input fiels. Examples:

"input": "3) czy starasz się o dziecko?",
"input": "4) czy planujesz jakieś zmiany?",

5) plwiki_random_word_pos.py - modify script with input field error (done PR#35) 6) polish-news-summarization.py - modify script to remove instructions with None in the input. (done in PR#28) 7) ipipan_polqa_questions.py - added deduplication and fixed multiple output. (done in PR#30) 8) poquad_text_extraction.py - added deduplication (done in PR#32) 9) human_annotators_common_errors.py - deduplicate examples by answers, format answers for further processing (done PR#33) 10) human_expert_gec_dataset.py - deduplicate examples by answers, format answers for further processing (done in PR#33) -) Examine other scripts for fixes