mxbi / arckit

Tools for working with the Abstraction & Reasoning Corpus
Apache License 2.0
126 stars 17 forks source link

Add additional datasets #6

Closed jrogu closed 1 month ago

jrogu commented 2 months ago

Hey @mxbi,

As discussed at #4, I have added additional data as you pointed out. There are currently 3 datasets.

Changes made:

Currently I added 3 whole datasets, but as you said, there are ways to make it more efficient since the datasets are almost the same.

Let me know what you think/ if there is anything to add/fix!

mxbi commented 2 months ago

Hey @jrogu, thanks for this, this is a great starting point. I think I'll make a few changes before I merge this:

I can make these changes this evening or I am happy for you to do some of it :)

jrogu commented 2 months ago

Hey, I merged multiple JSONs for each dataset, now there are only 3 JSONs, each for different version.

I also fixed load_data load_single to work with added changes.

I'm not sure if I understand exactly how do you want the versioning with GIT_COMMIT_SHA to look like, so I think I will leave it for you :)

mxbi commented 1 month ago

Great work! I made some changes to do the following:

Would you be able to share the code you used to make these JSONs? Think it might be useful for future versions. Once we have enough versions, I can also investigate a diff system where we only store the tasks that changed since the last version.

jrogu commented 1 month ago

Amazing final commit, @mxbi.

Here's the code you asked for: (assuming already downloaded data folder from repo, if you want the full code including getting the data, just ping me and I'll add it here as well)

import os
import json

training_dir = '' # Path to the folder with training JSONs
evaluation_dir = '' # Path to the folder with evaluation JSONs
output_file = '' # Output path

def transform_json_files(directory):
    data = {}
    for filename in os.listdir(directory):
        if filename.endswith(".json"):
            file_path = os.path.join(directory, filename)
            with open(file_path, 'r') as file:
                file_id = os.path.splitext(filename)[0]
                file_data = json.load(file)
                data[file_id] = {"train": file_data.get("train", []),
                                 "test": file_data.get("test", [])}
    return data

def merge_json_files(training_dir, evaluation_dir, output_file):
    training_data = transform_json_files(training_dir)
    evaluation_data = transform_json_files(evaluation_dir)

    merged_data = {
        "train": training_data,
        "eval": evaluation_data
    }

    with open(output_file, 'w') as outfile:
        json.dump(merged_data, outfile, separators=(',', ':'))

merge_json_files(training_dir, evaluation_dir, output_file)

The diff system would be amazing. Maybe a good idea would be storing the differences of each version with respect to the latest?

mxbi commented 1 month ago

Thanks!

Maybe a good idea would be storing the differences of each version with respect to the latest?

Yes, I agree. Basically, the way I think about it is we have the following requirements:

  1. Loading the whole dataset fast with load_data()
  2. Keeping the package size small
  3. Loading individual tasks fast with load_single() (to a lesser extent).

Currently we have 1 and 2 (the entire package is 1MB download and 5MB installed), so I think it is okay until we get a large number of versions. Diffing from latest gives us 1 and 2. Storing each task in a separate file gives us 2 and 3 (but not 1).

I'm trying to think of a decent scheme that gives us all three. Although it's not that important, it is very unlikely that anyone would notice the difference! :)