lean-dojo / ReProver

Retrieval-Augmented Theorem Provers for Lean
https://leandojo.org
MIT License
208 stars 44 forks source link

Publish datasets on HuggingFace #18

Closed SamPruden closed 1 year ago

SamPruden commented 1 year ago

It would be nice if all of the datasets were published on HuggingFace alongside the models. The existing datasets can be used fine, but this would help with ease of use.

Relatedly, it might be good to also publish a precomputed dataset of state premises, i.e. the output of python retrieval/main.py predict on the pretrained model. For now, this is the only thing I've needed to clone the repo for, as everything else that I want to do can be accomplished with plain Transformers.

The idea that I'm looking to experiment with involves only finetuning the decoder, so I expect that I'll also end up with a dataset of precomputed state encodings to speed up training. For my purposes it would be nice if that were also available, but I expect that's too specific for you guys to have any reason to do.

Also, just to confirm, premise things aren't available for Lean4 right? It looks like only the Lean3 dataset has the annotated versions.

Thanks for the great work, I've had fun playing around with these models already and I think they're going to be very helpful for testing some ideas I've had for a while!

SamPruden commented 1 year ago

My apologies, it appears this issue somehow got opened twice! I experienced a github error when posting, but apparently it submitted anyway without telling me, so I ended up posting twice.