Closed jrogu closed 1 month ago
Hey @jrogu, thanks for this, this is a great starting point. I think I'll make a few changes before I merge this:
if version in ['latest', 'GIT_COMMIT_SHA']:
I can make these changes this evening or I am happy for you to do some of it :)
Hey, I merged multiple JSONs for each dataset, now there are only 3 JSONs, each for different version.
I also fixed load_data
load_single
to work with added changes.
I'm not sure if I understand exactly how do you want the versioning with GIT_COMMIT_SHA to look like, so I think I will leave it for you :)
Great work! I made some changes to do the following:
Would you be able to share the code you used to make these JSONs? Think it might be useful for future versions. Once we have enough versions, I can also investigate a diff system where we only store the tasks that changed since the last version.
Amazing final commit, @mxbi.
Here's the code you asked for: (assuming already downloaded data folder from repo, if you want the full code including getting the data, just ping me and I'll add it here as well)
import os
import json
training_dir = '' # Path to the folder with training JSONs
evaluation_dir = '' # Path to the folder with evaluation JSONs
output_file = '' # Output path
def transform_json_files(directory):
data = {}
for filename in os.listdir(directory):
if filename.endswith(".json"):
file_path = os.path.join(directory, filename)
with open(file_path, 'r') as file:
file_id = os.path.splitext(filename)[0]
file_data = json.load(file)
data[file_id] = {"train": file_data.get("train", []),
"test": file_data.get("test", [])}
return data
def merge_json_files(training_dir, evaluation_dir, output_file):
training_data = transform_json_files(training_dir)
evaluation_data = transform_json_files(evaluation_dir)
merged_data = {
"train": training_data,
"eval": evaluation_data
}
with open(output_file, 'w') as outfile:
json.dump(merged_data, outfile, separators=(',', ':'))
merge_json_files(training_dir, evaluation_dir, output_file)
The diff system would be amazing. Maybe a good idea would be storing the differences of each version with respect to the latest
?
Thanks!
Maybe a good idea would be storing the differences of each version with respect to the
latest
?
Yes, I agree. Basically, the way I think about it is we have the following requirements:
load_data()
load_single()
(to a lesser extent).Currently we have 1 and 2 (the entire package is 1MB download and 5MB installed), so I think it is okay until we get a large number of versions. Diffing from latest
gives us 1 and 2. Storing each task in a separate file gives us 2 and 3 (but not 1).
I'm trying to think of a decent scheme that gives us all three. Although it's not that important, it is very unlikely that anyone would notice the difference! :)
Hey @mxbi,
As discussed at #4, I have added additional data as you pointed out. There are currently 3 datasets.
Changes made:
load_data
function now includes aversion
argument with options'latest'
,'kaggle'
, and'original'
.load_single
, chosen for its robustness.Currently I added 3 whole datasets, but as you said, there are ways to make it more efficient since the datasets are almost the same.
Let me know what you think/ if there is anything to add/fix!