hplt-project / OpusCleaner

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.
https://pypi.org/project/opuscleaner/
45 stars 13 forks source link

Batch filter chain management #37

Open jelmervdl opened 1 year ago

jelmervdl commented 1 year ago

If you have many datasets, you'd want to apply the same filter steps to quite a couple of them I suspect.

It would be helpful if we can provide some help with that. Either copying over the filter configuration of a different dataset.. or copy it to a dataset. Or some other way? Diffing filter step configurations maybe at some point?

I don't know yet what this should look like. I'm searching for an analogy, or a different sort of program that does something like this. I'd expect something like a 3d program would have something like this for the management of different materials maybe? Or Adobe Lightroom and photo editing effects?

jelmervdl commented 1 year ago

Kenneth mentioned that some rules are applicable for all languages in a dataset. Some are language specific. Some way of having this concept of an "axis" in the filter chain management would be great.

XapaJIaMnu commented 1 year ago

I wrote a quick program that does that.

#!/usr/bin/env python3
"""Given a filter, replicates that filter for the rest of the datasets"""
import sys
import os
from copy import deepcopy
from typing import Dict, Tuple

import json

if len(sys.argv) != 3:
    print("Usage", sys.argv[0], "path_to_filters.json path_to_dataset_dir")
    exit()

# create a list of files and dataset names
files:Dict[str, Tuple[str,str]] = dict()
# Get all files and remove the ones from the list that have filters already applied
all_files = [file for file in os.listdir(sys.argv[2]) if file[0] != '.' and file != 'categories.json']
all_files.sort()
all_files_copy = deepcopy(all_files)
for file in all_files_copy:
    if "json" in file:
        #Reconstruct the filename
        filenamebase = file.split('filters.json')[0]
        language_pair = file.split('.')[-3]
        src, trg = language_pair.split('-')
        srcfile = filenamebase + src + '.gz'
        trgfile = filenamebase + trg + '.gz'
        all_files.remove(srcfile)
        all_files.remove(trgfile)
        all_files.remove(file)
# Now put them in a convenient list
for file in all_files:
     # Get the filters name
     filternamebase = ".".join(file.split('.')[:-2])  # strip the suffix
     filtername = filternamebase + '.filters.json'
     if filtername in files:
         continue
     src, trg = filternamebase.split('.')[-1].split('-')
     srcfile = filternamebase + '.' + src + '.gz'
     trgfile = filternamebase + '.' + trg + '.gz'
     files[filtername] = (srcfile, trgfile)

# Now generate the json
with open(sys.argv[1], 'r', encoding='utf-8') as readfile:
    schema = json.load(readfile)

for filtername, files in files.items():
    new_schema = deepcopy(schema)
    new_schema['files'] = [files[0], files[1]]
    with open(sys.argv[2] + '/' + filtername, 'w', encoding='utf-8') as outfile:
        json.dump(new_schema, outfile, ensure_ascii=False, indent = 2)

Maybe include it in utils?