liyongsea / parallel_corpus_mnbvc

parallel corpus dataset from the mnbvc project
Apache License 2.0
8 stars 5 forks source link

feat[en]: rule-based english paragragh join #6

Closed voidf closed 9 months ago

voidf commented 1 year ago

针对英文写的一些规则,实现的功能有:

voidf commented 1 year ago
  1. 去噪(肉眼看,写规则)
  2. segmentation(中/英分别处理)(找轮子!)
  3. 让chatgpt给一个对齐的样例来评判
  4. 交给bertalign
voidf commented 1 year ago

可能用得上的文章: