lean-dojo / ReProver

Retrieval-Augmented Theorem Provers for Lean
https://leandojo.org
MIT License
218 stars 48 forks source link

Consider publishing datasets on HuggingFace #19

Closed SamPruden closed 1 year ago

SamPruden commented 1 year ago

It would be nice if all of the datasets were published on HuggingFace alongside the models. The existing datasets can be used fine, but this would help with ease of use.

Relatedly, it might be good to also publish a precomputed dataset of state premises. For now, this is the only thing I've needed to clone the repo for, as everything else that I want to do can be accomplished with plain Transformers.

The idea that I'm looking to experiment with involves only finetuning the decoder, so I expect that I'll also end up with a dataset of precomputed state encodings to speed up training. For my purposes it would be nice if that were also available, but I expect that's too specific for you guys to have any reason to do.

Also, just to confirm, premise things aren't available for Lean4 right? It looks like only the Lean3 dataset has the annotated versions.

Thanks for the great work, I've had fun playing around with these models already and I think they're going to be very helpful for testing some ideas I've had for a while!

yangky11 commented 1 year ago

Hi,

We don't plan to publish a precomputed dataset of premises due to the burden of maintenance.

Yes, premises are currently not available for Lean 4 but may become available in a few weeks.

SamPruden commented 1 year ago

Fair enough. I'm making one for my own use, so that may end up on HuggingFace if you have no objections.

I was also suggesting that your already published datasets be put on HuggingFace. Is that also something that you're not interested in? Would you object to somebody else posting them there if you don't?

yangky11 commented 1 year ago

We do not plan to put the dataset on Hugging Face. Please feel free to use the data in whatever way allowed by the license (CC-BY 2.0).