Closed liyongsea closed 1 year ago
https://digitallibrary.un.org/?ln=en 目前包含
目标是下载所有的文档并转成文本格式,上传huggingface
method
related data
Update 0427
pdf to text script is missing
新的数据源:https://documents.un.org
new issue: https://github.com/liyongsea/parallel_corpus_mnbvc/issues/48
https://digitallibrary.un.org/?ln=en 目前包含
目标是下载所有的文档并转成文本格式,上传huggingface
Update 0413 - download data after 2000
method
related data