Open jelmervdl opened 1 year ago
Kenneth mentioned that some rules are applicable for all languages in a dataset. Some are language specific. Some way of having this concept of an "axis" in the filter chain management would be great.
I wrote a quick program that does that.
#!/usr/bin/env python3
"""Given a filter, replicates that filter for the rest of the datasets"""
import sys
import os
from copy import deepcopy
from typing import Dict, Tuple
import json
if len(sys.argv) != 3:
print("Usage", sys.argv[0], "path_to_filters.json path_to_dataset_dir")
exit()
# create a list of files and dataset names
files:Dict[str, Tuple[str,str]] = dict()
# Get all files and remove the ones from the list that have filters already applied
all_files = [file for file in os.listdir(sys.argv[2]) if file[0] != '.' and file != 'categories.json']
all_files.sort()
all_files_copy = deepcopy(all_files)
for file in all_files_copy:
if "json" in file:
#Reconstruct the filename
filenamebase = file.split('filters.json')[0]
language_pair = file.split('.')[-3]
src, trg = language_pair.split('-')
srcfile = filenamebase + src + '.gz'
trgfile = filenamebase + trg + '.gz'
all_files.remove(srcfile)
all_files.remove(trgfile)
all_files.remove(file)
# Now put them in a convenient list
for file in all_files:
# Get the filters name
filternamebase = ".".join(file.split('.')[:-2]) # strip the suffix
filtername = filternamebase + '.filters.json'
if filtername in files:
continue
src, trg = filternamebase.split('.')[-1].split('-')
srcfile = filternamebase + '.' + src + '.gz'
trgfile = filternamebase + '.' + trg + '.gz'
files[filtername] = (srcfile, trgfile)
# Now generate the json
with open(sys.argv[1], 'r', encoding='utf-8') as readfile:
schema = json.load(readfile)
for filtername, files in files.items():
new_schema = deepcopy(schema)
new_schema['files'] = [files[0], files[1]]
with open(sys.argv[2] + '/' + filtername, 'w', encoding='utf-8') as outfile:
json.dump(new_schema, outfile, ensure_ascii=False, indent = 2)
Maybe include it in utils?
If you have many datasets, you'd want to apply the same filter steps to quite a couple of them I suspect.
It would be helpful if we can provide some help with that. Either copying over the filter configuration of a different dataset.. or copy it to a dataset. Or some other way? Diffing filter step configurations maybe at some point?
I don't know yet what this should look like. I'm searching for an analogy, or a different sort of program that does something like this. I'd expect something like a 3d program would have something like this for the management of different materials maybe? Or Adobe Lightroom and photo editing effects?