skyl / corpora

Corpora is a self-building corpus that can help build other arbitrary corpora
GNU Affero General Public License v3.0
2 stars 0 forks source link

Enhance Python Code Text Splitting #23

Open skyl opened 1 week ago

skyl commented 1 week ago

Objective

Enhance the current Python code text splitting mechanism by experimenting with more sophisticated methods such as AST (Abstract Syntax Tree) or configuring existing tools for better performance.

Background

In the py/packages/corpora_ai/split.py, the PythonCodeTextSplitter from the langchain_text_splitters library is being used for splitting Python code. However, this method may not be optimal as it tends to split code indiscriminately.

Task

  1. Research Alternatives:

    • Explore options for utilizing AST-based splitting to handle Python syntax more effectively.
    • Investigate other third-party libraries that offer advanced code splitting capabilities.
  2. Configuration:

    • Review the current configuration of PythonCodeTextSplitter and identify potential enhancements or settings that optimize its performance with Python code.
  3. Implementation:

    • Experiment with different text splitting mechanisms for Python code using AST or reconfigured existing methods.
    • Ensure the new method integrates seamlessly with the existing codebase.
  4. Testing and Comparison:

    • Develop test cases to validate the new splitting method against diverse Python code snippets.
    • Compare the results with the current method to evaluate improvements in clarity and logic separation.

Acceptance Criteria