Add additional datasets

jrogu commented 2 months ago

Hey @mxbi,

As discussed at #4, I have added additional data as you pointed out. There are currently 3 datasets.

Changes made:

load_data function now includes a version argument with options 'latest', 'kaggle', and 'original'.
Implemented retrieval of task using id from arcagi in load_single, chosen for its robustness.

Currently I added 3 whole datasets, but as you said, there are ways to make it more efficient since the datasets are almost the same.

Let me know what you think/ if there is anything to add/fix!

mxbi commented 2 months ago

Hey @jrogu, thanks for this, this is a great starting point. I think I'll make a few changes before I merge this:

Merging the many files into a single JSON for the dataset. Having many small files is not ideal for performance. (although it does make loading one task at a time faster, so there is a tradeoff here).
With the versions, allowing multiple different references to the same state. So someone can e.g. switch from latest to a specific hash to pin the version they're using:
```
if version in ['latest', 'GIT_COMMIT_SHA']:
```
Having an exception if the requested version doesn't exist.

I can make these changes this evening or I am happy for you to do some of it :)

jrogu commented 2 months ago

Hey, I merged multiple JSONs for each dataset, now there are only 3 JSONs, each for different version.

I also fixed load_data load_single to work with added changes.

I'm not sure if I understand exactly how do you want the versioning with GIT_COMMIT_SHA to look like, so I think I will leave it for you :)

mxbi commented 1 month ago

Great work! I made some changes to do the following:

Make the new JSON files appear in the python package when installed
Refactor the getting of the right json into a separate function
Added some alternative names for the versions (which will always be pinned, but one can specify a latest version).
Remove spaces from the original JSON to save 1MB

Would you be able to share the code you used to make these JSONs? Think it might be useful for future versions. Once we have enough versions, I can also investigate a diff system where we only store the tasks that changed since the last version.

jrogu commented 1 month ago

Amazing final commit, @mxbi.

Here's the code you asked for: (assuming already downloaded data folder from repo, if you want the full code including getting the data, just ping me and I'll add it here as well)

import os
import json

training_dir = '' # Path to the folder with training JSONs
evaluation_dir = '' # Path to the folder with evaluation JSONs
output_file = '' # Output path

def transform_json_files(directory):
    data = {}
    for filename in os.listdir(directory):
        if filename.endswith(".json"):
            file_path = os.path.join(directory, filename)
            with open(file_path, 'r') as file:
                file_id = os.path.splitext(filename)[0]
                file_data = json.load(file)
                data[file_id] = {"train": file_data.get("train", []),
                                 "test": file_data.get("test", [])}
    return data

def merge_json_files(training_dir, evaluation_dir, output_file):
    training_data = transform_json_files(training_dir)
    evaluation_data = transform_json_files(evaluation_dir)

    merged_data = {
        "train": training_data,
        "eval": evaluation_data
    }

    with open(output_file, 'w') as outfile:
        json.dump(merged_data, outfile, separators=(',', ':'))

merge_json_files(training_dir, evaluation_dir, output_file)

The diff system would be amazing. Maybe a good idea would be storing the differences of each version with respect to the latest?

mxbi commented 1 month ago

Thanks!

Maybe a good idea would be storing the differences of each version with respect to the latest?

Yes, I agree. Basically, the way I think about it is we have the following requirements:

Loading the whole dataset fast with load_data()
Keeping the package size small
Loading individual tasks fast with load_single() (to a lesser extent).

Currently we have 1 and 2 (the entire package is 1MB download and 5MB installed), so I think it is okay until we get a large number of versions. Diffing from latest gives us 1 and 2. Storing each task in a separate file gives us 2 and 3 (but not 1).

I'm trying to think of a decent scheme that gives us all three. Although it's not that important, it is very unlikely that anyone would notice the difference! :)

mxbi / arckit

Add additional datasets #6