chatgpt合成分段数据集

voidf commented 1 year ago

已知直接让chatgpt输出下标经常会得到没有意义的长串数字，故prompt让其回答分段结果。

目前调好的prompt如下：

I need your help to solve a breakline elimination problem,
given some text exported from PDF, 
some breaklines may split the text as meaningful paragraphs but others could separate them unexpectly,
in this case, you should join adjacent lines if they can form a meaningful paragraph and replace the breakline symbols as spaces,
leave the indexing information and some lines that can not form a paragragh as it is.
Leave the breaklines that can split the text as meaningful paragraphs.
The input may contains a whole line of pagination infos and indexing infos,
you should not join them to the adjacent paragraphs.
You should only determine the breaklines should be keep or replaced,
and leave other text as it is.
Please do not add more word to the input text, 
do not answer any other word except the task output,
do not add any characters to the end of the task output.
Here is the input text:

回答分段结果会引起几个问题：

chatgpt会过滤一定的噪声，导致输出结果和输入不一致
chatgpt倾向于在未完结的段落后面将其补完，导致前一次未拼完的结果（包括了chatgpt补完的）放到下一次中会有局部重复

目前阶段需要解决的问题：

恢复输出段落所消除的换行下标

voidf commented 1 year ago

完成了从裸文本标注到换行下标的转换，且这个功能可以用于人工标注中。（使用了最长公共子序列全局对齐）单文件实现代码：https://github.com/voidf/parallel_corpus_mnbvc/blob/main/alignment/get_labeled_index.py

gpt-3.5的输出仍存在一定问题，这些数据可以通过人工标注来修正。数据集已经上传至hf：https://huggingface.co/datasets/bot-yaya/EN_PARAGRAPH_GPT_JOINED

人工标注数据集已经上传至hf：https://huggingface.co/datasets/bot-yaya/EN_PARAGRAPH_HUMAN_JOINED

以上数据集可以用此脚本下载到本地恢复成成段文件以便人类查看成段效果：https://github.com/voidf/parallel_corpus_mnbvc/blob/main/alignment/download_and_visualize.py ，这些文件可以修改之后用get_labeled_index.py 重新做成下标形式，并且通过https://github.com/voidf/parallel_corpus_mnbvc/blob/main/alignment/push_idx_to_hf.py 脚本上传至hf。

gpt脚本初步试验已经可以稳定运行：https://github.com/voidf/parallel_corpus_mnbvc/blob/main/alignment/join_use_chatgpt.py ，目前已经收集了约100篇文章的成段标注。并行请求实测容易导致openai负载过高引起server error，故暂时采用串行请求。

liyongsea commented 1 year ago

solved by https://github.com/liyongsea/parallel_corpus_mnbvc/pull/31

liyongsea / parallel_corpus_mnbvc

chatgpt合成分段数据集 #18