update load_data for new dataset

jrogu commented 2 months ago

As noted in #3, there is a mismatch between your data and recently published kaggle dataset. I started working on adapting your code to the new structure of data provided by kaggle.

Executing data.py script directly returns the same output, so everything seems to be working (the first 10 train tasks are the same across both datasets).

Let me know if it seems correct, if so I can continue to work on the rest of the code.

mxbi commented 2 months ago

Hey @jrogu I missed this PR! Thank you for submitting it. If you would be open to re-submitting it we could add this in.

I was picturing a new optional version argument for load_data() and load_task(), which would allow multiple ways of referencing the data:

latest (default), arcagi, aa922be204204ec148a1137fe6ed4d34ddde812b for whatever data is in https://github.com/fchollet/ARC-AGI at the moment.
kaggle, 79427e110a6d35bab1224f1c5238695eb8a3169a which points to https://github.com/fchollet/ARC-AGI/tree/79427e110a6d35bab1224f1c5238695eb8a3169a and I believe this is the Kaggle dataset.
original, arc, which points to the original data here.

Note that the kaggle dataset does not include one latest fix: https://www.kaggle.com/competitions/arc-prize-2024/discussion/513114#2879917

Let me know what you think. I'm happy to update the code but would be good to get your contribution in!

jrogu commented 2 months ago

Hey @mxbi, thanks for detailed answer. Your idea is much nicer and robust, I'm happy to help with a new PR using logic you just provided.

If I understand correctly, you want the code to access the data from remote repositories, rather than store it downloaded in library? If so, first time getting data should save it for future access?

mxbi commented 2 months ago

A PR for this would be fantastic.

I think we can still keep the JSON in the repository (the idea is that someone downloading an ARC-specific package would be happy to have a few megabytes of ARC data on their system). We can have three versions for now.

If there becomes a large number of versions in the future, we could think about storing diffs or downloading non-default versions on demand. We can already save 50% by minifying the existing JSON, for example.

mxbi / arckit

update load_data for new dataset #4