lovecambi / qebrain

machine translation and quality estimation
BSD 2-Clause "Simplified" License
34 stars 18 forks source link

Can you opensource the code of filtering data? #8

Closed meettyj closed 5 years ago

meettyj commented 5 years ago

In your paper. It said "We filtered all the corpora except src-pe pairs with basic rules to guarantee the quality. A highquality sentence pair should both start with a Unicode letter character, the lengths of them are equal to or less than 70, and the length ratio of the source sentence and the target one should be bounded by 1/3 and 3."

I was wondering can you guys make the code public so we can make better use of it. Thanks a lot!

meettyj commented 5 years ago

Besides, do you merge all of the data (Europarl, paraCrawl, Common Crawl, ...) into one doc (like train.lower.en) in the para folder?

lovecambi commented 5 years ago

In your paper. It said "We filtered all the corpora except src-pe pairs with basic rules to guarantee the quality. A highquality sentence pair should both start with a Unicode letter character, the lengths of them are equal to or less than 70, and the length ratio of the source sentence and the target one should be bounded by 1/3 and 3."

I was wondering can you guys make the code public so we can make better use of it. Thanks a lot!

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os,sys,re

biling = sys.argv[1] # input is parallel sentence pair delimited by tab

def clean():
    for line in open(biling):
        line = line.strip()

        src,tgt = line.split('\t')

        len_src = len(src.split())
        len_tgt = len(tgt.split())

        if len_src > 70 or len_tgt > 70:
            continue

        len_ratio = float(len_src)/float(len_tgt)

        if len_ratio <= 3.0 and len_ratio >= 0.33333:
            print(line)

clean()