Closed meettyj closed 5 years ago
Besides, do you merge all of the data (Europarl, paraCrawl, Common Crawl, ...) into one doc (like train.lower.en) in the para folder?
In your paper. It said "We filtered all the corpora except src-pe pairs with basic rules to guarantee the quality. A highquality sentence pair should both start with a Unicode letter character, the lengths of them are equal to or less than 70, and the length ratio of the source sentence and the target one should be bounded by 1/3 and 3."
I was wondering can you guys make the code public so we can make better use of it. Thanks a lot!
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os,sys,re
biling = sys.argv[1] # input is parallel sentence pair delimited by tab
def clean():
for line in open(biling):
line = line.strip()
src,tgt = line.split('\t')
len_src = len(src.split())
len_tgt = len(tgt.split())
if len_src > 70 or len_tgt > 70:
continue
len_ratio = float(len_src)/float(len_tgt)
if len_ratio <= 3.0 and len_ratio >= 0.33333:
print(line)
clean()
In your paper. It said "We filtered all the corpora except src-pe pairs with basic rules to guarantee the quality. A highquality sentence pair should both start with a Unicode letter character, the lengths of them are equal to or less than 70, and the length ratio of the source sentence and the target one should be bounded by 1/3 and 3."
I was wondering can you guys make the code public so we can make better use of it. Thanks a lot!