lean-dojo / LeanDojo

Tool for data extraction and interacting with Lean programmatically.
https://leandojo.org
MIT License
556 stars 87 forks source link

Unicode in data #209

Open timechess opened 2 weeks ago

timechess commented 2 weeks ago

I noticed that all the unicode characters in the dataset are ascii-encoded Seems like you didn't set ensure_ascii=False. I wonder if this will affect the performance of premise selection. After all, in practical applications, the unicode characters in the proof state are not ascii-encoded.

yangky11 commented 2 weeks ago

Thanks for noticing that. I'll try to re-run the experiments to see if the encoding makes a difference when I get a chance, though it's currently not our priority.