Closed voidf closed 10 months ago
翻了下以前拿pdf转文本出来的东西嗯做对齐的脚本https://github.com/liyongsea/parallel_corpus_mnbvc/blob/main/alignment/script/preprocess.py。这里给新转出的docx提几个去噪需求:
补充包: https://www.aliyundrive.com/s/zK3PeY9yyom 主包:https://www.aliyundrive.com/s/vnrHzcpiRF6 主包超过4G不能打自解压包,需要用下面这个工具转一下 https://github.com/ciscolxh/aliyunshare/tree/main
翻了下以前拿pdf转文本出来的东西嗯做对齐的脚本https://github.com/liyongsea/parallel_corpus_mnbvc/blob/main/alignment/script/preprocess.py。这里给新转出的docx提几个去噪需求: