docx -> txt 需求 - Githubissues

liyongsea / parallel_corpus_mnbvc

parallel corpus dataset from the mnbvc project

Apache License 2.0

7 stars 5 forks source link

docx -> txt 需求 #55

Closed voidf closed 7 months ago

voidf commented 10 months ago

翻了下以前拿pdf转文本出来的东西嗯做对齐的脚本https://github.com/liyongsea/parallel_corpus_mnbvc/blob/main/alignment/script/preprocess.py。这里给新转出的docx提几个去噪需求：

页眉页脚页码文件号不要，页脚包括本页引用注释的块，这个上标引用在文中的话需要去掉。
首页冗余的部分不要，比如首页的图片，文件号，固定格式的套话，比如说【会议什么什么时间在什么什么地方举行】【会议发言人是谁，哪个国家】这类文本整段不要。
如果识别得出来的话，目录不要。
1中提到的上标引用，一般含有统一的超链接或者格式。如果在段落文本中出现，以致整个句子断开，如【什么什么会议1提出了什么什么文件2，文件里面提到...】这段话中的1和2这些数字要从段落中去掉。
尽量提供上下文连续的，成段的文本块以便后续流程对齐。

voidf commented 10 months ago

补充包： https://www.aliyundrive.com/s/zK3PeY9yyom 主包：https://www.aliyundrive.com/s/vnrHzcpiRF6 主包超过4G不能打自解压包，需要用下面这个工具转一下 https://github.com/ciscolxh/aliyunshare/tree/main