人工反向工程数据集

liyongsea commented 1 year ago

Wzixiao commented 1 year ago

fix: human annotated dataset (19 files, The amount of data has increased by 7.6 times)

The following is the data compared 'manually' with 'gpt'		precision	recall	f1-score
False	0.95	0.97	0.96	24100
True	0.89	0.82	0.86	7109
Accuracy			0.94	31209
Macro avg	0.92	0.90	0.91	31209
Weighted avg	0.94	0.94	0.94	31209

The following is the data compared 'manually' with 'PuncturationAndCapitalLetterDetector'		precision	recall	f1-score
False	0.95	0.80	0.87	24655
True	0.53	0.84	0.65	6554
Accuracy			0.81	31209
Macro avg	0.74	0.82	0.76	31209
Weighted avg	0.86	0.81	0.82	31209

The following is the data compared 'gpt' with 'PuncturationAndCapitalLetterDetector'		precision	recall	f1-score
False	0.89	0.76	0.82	72993
True	0.56	0.76	0.65	28726
Accuracy			0.76	101719
Macro avg	0.73	0.76	0.73	101719
Weighted avg	0.80	0.76	0.77	101719

Wzixiao commented 1 year ago

一个从 https://www.gutenberg.org/ 下载的书籍制作的反向数据库,数据集大小2GB（仅用于展示，不作为正式数据集）

*以下部分只展示可行性，因为需要人工修正一些\n，上述dataset是没有经过人工修正的

以下是两个文本人工修改约2%换行后“GptBatchDetector”生成的标准:

	precision	recall	f1-score	support
False	0.99	0.97	0.98	263
True	0.93	0.97	0.95	110
accuracy			0.97	373
macro avg	0.96	0.97	0.96	373
weighted avg	0.97	0.97	0.97	373

	precision	recall	f1-score	support
False	0.88	0.91	0.89	2589
True	0.81	0.75	0.78	1317
accuracy			0.86	3906
macro avg	0.84	0.83	0.84	3906
weighted avg	0.85	0.86	0.85	3906

Wzixiao commented 1 year ago

以下图片是跑了其中10个文件使用"GPTBatchDetector(token_limit=256)"所生成的accuracy 20230614213939-1346x446

liyongsea / parallel_corpus_mnbvc