liyongsea / parallel_corpus_mnbvc

parallel corpus dataset from the mnbvc project
Apache License 2.0
8 stars 5 forks source link

[UN corpus] 下载联合国digital library的pdf并且转化成文本格式上传huggingface #4

Closed liyongsea closed 1 year ago

liyongsea commented 1 year ago

https://digitallibrary.un.org/?ln=en 目前包含

目标是下载所有的文档并转成文本格式,上传huggingface

Update 0413 - download data after 2000

method

related data

liyongsea commented 1 year ago
Wzixiao commented 1 year ago

Update 0427

  1. 根据时间排序完毕的链接获取dataset地址 "https://huggingface.co/datasets/ranWang/UN_PDF_RECORD_SET"
  2. 2000年之后的url信息dataset地址同上
  3. pdf文件全部下载完毕
  4. 提取pdf文字全部完成
  5. 分批上传dataset完成
  6. 上传完成,最新dataset地址 “https://huggingface.co/datasets/ranWang/UN_Historical_PDF_Article_Text_Corpus
liyongsea commented 1 year ago

pdf to text script is missing

liyongsea commented 1 year ago
Wzixiao commented 1 year ago

新的数据源:https://documents.un.org

new issue: https://github.com/liyongsea/parallel_corpus_mnbvc/issues/48