Closed liyongsea closed 11 months ago
The following is the data compared 'manually' with 'gpt' | precision | recall | f1-score | support | |
---|---|---|---|---|---|
False | 0.95 | 0.97 | 0.96 | 24100 | |
True | 0.89 | 0.82 | 0.86 | 7109 | |
Accuracy | 0.94 | 31209 | |||
Macro avg | 0.92 | 0.90 | 0.91 | 31209 | |
Weighted avg | 0.94 | 0.94 | 0.94 | 31209 |
The following is the data compared 'manually' with 'PuncturationAndCapitalLetterDetector' | precision | recall | f1-score | support | |
---|---|---|---|---|---|
False | 0.95 | 0.80 | 0.87 | 24655 | |
True | 0.53 | 0.84 | 0.65 | 6554 | |
Accuracy | 0.81 | 31209 | |||
Macro avg | 0.74 | 0.82 | 0.76 | 31209 | |
Weighted avg | 0.86 | 0.81 | 0.82 | 31209 |
The following is the data compared 'gpt' with 'PuncturationAndCapitalLetterDetector' | precision | recall | f1-score | support | |
---|---|---|---|---|---|
False | 0.89 | 0.76 | 0.82 | 72993 | |
True | 0.56 | 0.76 | 0.65 | 28726 | |
Accuracy | 0.76 | 101719 | |||
Macro avg | 0.73 | 0.76 | 0.73 | 101719 | |
Weighted avg | 0.80 | 0.76 | 0.77 | 101719 |
一个从 https://www.gutenberg.org/ 下载的书籍制作的反向数据库,数据集大小2GB(仅用于展示,不作为正式数据集)
未经过人工修改的数据集地址:https://huggingface.co/datasets/ranWang/books_paragraph
*以下部分只展示可行性,因为需要人工修正一些\n,上述dataset是没有经过人工修正的
以下是两个文本人工修改约2%换行后“GptBatchDetector”生成的标准:
precision | recall | f1-score | support | |
---|---|---|---|---|
False | 0.99 | 0.97 | 0.98 | 263 |
True | 0.93 | 0.97 | 0.95 | 110 |
accuracy | 0.97 | 373 | ||
macro avg | 0.96 | 0.97 | 0.96 | 373 |
weighted avg | 0.97 | 0.97 | 0.97 | 373 |
precision | recall | f1-score | support | |
---|---|---|---|---|
False | 0.88 | 0.91 | 0.89 | 2589 |
True | 0.81 | 0.75 | 0.78 | 1317 |
accuracy | 0.86 | 3906 | ||
macro avg | 0.84 | 0.83 | 0.84 | 3906 |
weighted avg | 0.85 | 0.86 | 0.85 | 3906 |
第一版: https://huggingface.co/datasets/ranWang/books_paragraph_test
以下图片是跑了其中10个文件使用"GPTBatchDetector(token_limit=256)"所生成的accuracy