liyongsea / parallel_corpus_mnbvc

parallel corpus dataset from the mnbvc project
Apache License 2.0
11 stars 6 forks source link

【Linux中国官方数据集】 #76

Open voidf opened 4 months ago

voidf commented 4 months ago

https://huggingface.co/datasets/linux-cn/archive 需要人写爬虫

voidf commented 4 months ago

中文已经拿到了,差一个英文原文的数据需要爬取(?)

链接在哪需要找一下

voidf commented 2 months ago

交给阿伟

voidf commented 1 month ago

能爬下来的都爬下来了。需要做清洗校验,https://huggingface.co/datasets/LxYxvv/linux-cn-archive 需要另外分人整理成平行语料

voidf commented 1 month ago