Closed josojo closed 1 year ago
That's great, thanks! I've been a bit swamped with LeanInfer recently and will take a look when I get a chance.
@CodiumAI-Agent /review
๐ฏ Main theme: Improving the loading speed of traced repositories by implementing a recursive loading strategy.
๐ PR summary: This PR aims to improve the loading speed of traced repositories by loading only the necessary traced files. It introduces a new environment variable to control whether to use this recursive loading strategy. The changes are mainly in the 'traced_data.py' and 'constants.py' files.
๐ Type of PR: Enhancement
๐งช Relevant tests added: No
๐ Security concerns: No security concerns found
๐ก General suggestions: The PR is well-structured and the changes are clear. The addition of the environment variable to control the loading strategy is a good approach. However, it would be beneficial to add some tests to ensure that the new loading strategy works as expected and does not introduce any bugs.
๐ค Code feedback:
relevant file: src/lean_dojo/data_extraction/traced_data.py
suggestion: Consider handling the case where the XML file does not exist when loading traced dependencies recursively. This can be done by checking if the file exists before trying to load it. [important]
relevant line: '+ new_traced_file = TracedFile.from_xml(root_dir, xml_to_index, repo)'
relevant file: src/lean_dojo/data_extraction/traced_data.py
suggestion: It would be better to use os.path.join() for constructing file paths. This would ensure that the code is platform independent. [medium]
relevant line: '+ str(root_dir) + "/" + dep_path_str.replace("/src/", "/lib/")'
relevant file: src/lean_dojo/data_extraction/traced_data.py
suggestion: The assertion len(json_files) == self.traced_files_graph.number_of_nodes() is skipped when LOAD_TRACED_DEPENDENCIES_RECURSIVELY is True. If this assertion is important, consider finding a way to keep it for both cases. [medium]
relevant line: '+ if not LOAD_TRACED_DEPENDENCIES_RECURSIVELY:'
Tag me in a comment '@CodiumAI-Agent' and add one of the following commands: /review [-i]: Request a review of your Pull Request. For an incremental review, which only considers changes since the last review, include the '-i' option. /describe: Modify the PR title and description based on the contents of the PR. /improve [--extended]: Suggest improvements to the code in the PR. Extended mode employs several calls, and provides a more thorough feedback. /ask \<QUESTION>: Pose a question about the PR. /update_changelog: Update the changelog based on the PR's contents.
To edit any configuration parameter from configuration.toml, add --config_path=new_value For example: /review --pr_reviewer.extra_instructions="focus on the file: ..." To list the possible configuration parameters, use the /config command.
Merging to dev
and will update main
after testing.
@josojo Please see the comment in the code. Since it has been merged into dev
, I can make necessary changes in the code, but I want to check with you first.
@josojo Could you please take a look at https://github.com/lean-dojo/LeanDojo/pull/63 to see if it's reasonable? Thanks!
Currently, loading traced repos ( by running TracedRepo.load_from_disk) of simple repos like the https://github.com/yangky11/lean4-example takes quite some time on my machine.
I reduced the time by a factor of nearly 10x by loading only the needed traced files. This means we first only read the traced files of the main repo and then only the traced libraries that we really depend on when building the dependency graph.
This PR is quite important for me as it allows me to iterate much quicker on the code, as simple tests take much shorter. If we use the same tracing technique, then the unit tests would run on small machines at reasonable times. But this is for a future PR.
Since this recursive loading is not guaranteed to be faster(it's not so easy to parallelize), I put the recursive loading behind a FLAG.