mxbi / arckit

Tools for working with the Abstraction & Reasoning Corpus
Apache License 2.0
126 stars 17 forks source link

update load_data for new dataset #4

Closed jrogu closed 2 months ago

jrogu commented 2 months ago

As noted in #3, there is a mismatch between your data and recently published kaggle dataset. I started working on adapting your code to the new structure of data provided by kaggle.

Executing data.py script directly returns the same output, so everything seems to be working (the first 10 train tasks are the same across both datasets).

Let me know if it seems correct, if so I can continue to work on the rest of the code.

mxbi commented 2 months ago

Hey @jrogu I missed this PR! Thank you for submitting it. If you would be open to re-submitting it we could add this in.

I was picturing a new optional version argument for load_data() and load_task(), which would allow multiple ways of referencing the data:

Note that the kaggle dataset does not include one latest fix: https://www.kaggle.com/competitions/arc-prize-2024/discussion/513114#2879917

Let me know what you think. I'm happy to update the code but would be good to get your contribution in!

jrogu commented 2 months ago

Hey @mxbi, thanks for detailed answer. Your idea is much nicer and robust, I'm happy to help with a new PR using logic you just provided.

If I understand correctly, you want the code to access the data from remote repositories, rather than store it downloaded in library? If so, first time getting data should save it for future access?

mxbi commented 2 months ago

A PR for this would be fantastic.

I think we can still keep the JSON in the repository (the idea is that someone downloading an ARC-specific package would be happy to have a few megabytes of ARC data on their system). We can have three versions for now.

If there becomes a large number of versions in the future, we could think about storing diffs or downloading non-default versions on demand. We can already save 50% by minifying the existing JSON, for example.