liyongsea / parallel_corpus_mnbvc

parallel corpus dataset from the mnbvc project
Apache License 2.0
8 stars 5 forks source link

人工反向工程数据集 #19

Closed liyongsea closed 11 months ago

liyongsea commented 1 year ago
Wzixiao commented 1 year ago
The following is the data compared 'manually' with 'gpt' precision recall f1-score support
False 0.95 0.97 0.96 24100
True 0.89 0.82 0.86 7109
Accuracy 0.94 31209
Macro avg 0.92 0.90 0.91 31209
Weighted avg 0.94 0.94 0.94 31209
The following is the data compared 'manually' with 'PuncturationAndCapitalLetterDetector' precision recall f1-score support
False 0.95 0.80 0.87 24655
True 0.53 0.84 0.65 6554
Accuracy 0.81 31209
Macro avg 0.74 0.82 0.76 31209
Weighted avg 0.86 0.81 0.82 31209
The following is the data compared 'gpt' with 'PuncturationAndCapitalLetterDetector' precision recall f1-score support
False 0.89 0.76 0.82 72993
True 0.56 0.76 0.65 28726
Accuracy 0.76 101719
Macro avg 0.73 0.76 0.73 101719
Weighted avg 0.80 0.76 0.77 101719
Wzixiao commented 1 year ago

一个从 https://www.gutenberg.org/ 下载的书籍制作的反向数据库,数据集大小2GB(仅用于展示,不作为正式数据集)

未经过人工修改的数据集地址:https://huggingface.co/datasets/ranWang/books_paragraph

*以下部分只展示可行性,因为需要人工修正一些\n,上述dataset是没有经过人工修正的

以下是两个文本人工修改约2%换行后“GptBatchDetector”生成的标准:

precision recall f1-score support
False 0.99 0.97 0.98 263
True 0.93 0.97 0.95 110
accuracy 0.97 373
macro avg 0.96 0.97 0.96 373
weighted avg 0.97 0.97 0.97 373
precision recall f1-score support
False 0.88 0.91 0.89 2589
True 0.81 0.75 0.78 1317
accuracy 0.86 3906
macro avg 0.84 0.83 0.84 3906
weighted avg 0.85 0.86 0.85 3906
Wzixiao commented 1 year ago

第一版: https://huggingface.co/datasets/ranWang/books_paragraph_test

以下图片是跑了其中10个文件使用"GPTBatchDetector(token_limit=256)"所生成的accuracy 20230614213939-1346x446