lean-dojo / LeanDojo

Tool for data extraction and interacting with Lean programmatically.
https://leandojo.org
MIT License
536 stars 80 forks source link

Add support for data extraction in Lean 4 & Fix known bugs #28

Closed Peiyang-Song closed 1 year ago

Peiyang-Song commented 1 year ago

Overview

This pull request adds support for data extraction in Lean 4, with all premise-related features enabled, and fixes known bugs related to data extraction functionalities. Below is a brief summary.

Premise Info Extraction

ExtractData.lean previously only extracts tactic states and ASTs from a Lean 4 file. This PR adds premise extraction, which is useful in building a dataset especially for retrieval-augmented LMs. Several important technical details:

  1. For each premise, this PR extracts its full names, premise positions of its occurrence, original module (source file) of its definition, and positions of its definition in its module.

  2. Previously, we were working in the IO Monad throughout the whole file. This PR switches to the more powerful monad (Meta Monad), because it gives access to environments. This is necessary because extracting premise info (and specifically its source information) requires to jump among files globally. One thing to keep in mind: in cases of future developments when we would like to change outputs, remember to locally downgrade Meta Monad to IO Monad (there is an example of how to do this in the main function of this file, after this PR).

  3. Extracting premise information takes much more time than the previous features we implemented. This is expected because jumping among files causes each file to be visited for more than once. The total runtime can increase from ~1h (w\o premise extraction) to ~6h (w\ premise extraction) on 32 CPUs.

  4. This PR sets maxHeartbeats to be 0, which is dangerous. This is temporary for development purpose.

Post-processing

Post-processing is done in Python with the main goal being matching information extracted by ExtractData.lean with the AST pretty printed, so that many more features can be supported by just traversing parts of AST.

  1. This PR adds function get_premise_full_names for Lean 4, under TracedTheorem class in traced_data.py. Named by use.

  2. This PR adds function get_premise_definitions for Lean 4, under TracedFile class in traced_data.py. Named by use.

  3. Main implementation is in _post_process_lean4 in traced_data.py and corresponding changes in lean4/node.py. One thing to keep in mind: the position encoding obtained from Python and from Lean4 are different: [start, end) in Python versus (start, end] in Lean4. A simple fix for this problem is just adding 1 to both start and end positions.

Fix of known bugs

In addition to premise-related features, we also fixed several known bugs / unimplemented features about data extraction in Lean 4

  1. This PR adds codes to support get_annotated_tactics feature for Lean 4, where we pair each raw tactic with an annotated version that elaborates on more information related to the premises being used inside the tactic. This is invaluable for training ML models for premise selection. Main implementation is under TracedTactic class in traced_data.py.

  2. This PR extracts import information from any Lean 4 file and then builds dependency graphs to store such information inside a traced repo. Main implementation is under TracedFile class in traced_data.py and _post_process_lean4 and build_dependency_graph. Also, this PR makes corresponding implementations in lean4/node.py. One thing to keep in mind: the nodes in the dependency graph was previously Path type from the Python Pathlib. This PR changes it to str type because of a discovered bug originated from different types of paths that Pathlib supports.

Implementation of scripts

This PR adds generate-benchmark-lean4.ipynb and demo-lean4.ipynb, showcasing new features for data extraction in Lean 4. Details are omitted due to discussions in those files themselves.

Minor

  1. This PR adds a directory simptest under tests, containing simple testing files made during development.

  2. This PR also solves some minor bugs / typos that are trivial.