liyongsea / parallel_corpus_mnbvc

parallel corpus dataset from the mnbvc project
Apache License 2.0
7 stars 5 forks source link

Refactor preprocess script #43

Closed voidf closed 1 year ago

liyongsea commented 1 year ago

在一个notebook里面显示

Wzixiao commented 1 year ago

在"ranWang/un_pdf_random_preprocessed"dataset中(15293 record),已经过滤信息如下表格(包含5国语言)

Reason Reason count Lines count Frequency
line_freq 1817497 1817599 0.0387
overall_tk_freq 372799 372874 0.0079
annotation block 133609 899462 0.0192
tk_freq 165450 165451 0.0035
likely page number 522006 522076 0.0111
total 3011361 3777462 0.0804

总的过滤的行数比例为 0.08042826121888653 总的过滤的文件数量比例为(只存在一页两页的,全是噪点,几乎没有数据) 0.3121766902073681