Publish datasets on HuggingFace

It would be nice if all of the datasets were published on HuggingFace alongside the models. The existing datasets can be used fine, but this would help with ease of use.

Relatedly, it might be good to also publish a precomputed dataset of state premises, i.e. the output of python retrieval/main.py predict on the pretrained model. For now, this is the only thing I've needed to clone the repo for, as everything else that I want to do can be accomplished with plain Transformers.

The idea that I'm looking to experiment with involves only finetuning the decoder, so I expect that I'll also end up with a dataset of precomputed state encodings to speed up training. For my purposes it would be nice if that were also available, but I expect that's too specific for you guys to have any reason to do.

Also, just to confirm, premise things aren't available for Lean4 right? It looks like only the Lean3 dataset has the annotated versions.

Thanks for the great work, I've had fun playing around with these models already and I think they're going to be very helpful for testing some ideas I've had for a while!

lean-dojo / ReProver

Publish datasets on HuggingFace #18