Python Codebase Vctorize AST 3 levels perspecives

Your approach to vectorizing the codebase using Abstract Syntax Trees (AST) from three different perspectives is an excellent idea. It provides a holistic view of the codebase, covering structural, functional, and implementation aspects. Here's a more detailed plan and implementation based on these three perspectives:

Step-by-Step Plan

Setting Up the Environment

Ensure you have the necessary packages installed:
```
pip install transformers torch ast prettytable supabase radon
```

Perspective 1: File and Folder Structure

This perspective involves analyzing the overall structure of the codebase, including files, folders, and imports.

import os
import ast
from transformers import AutoTokenizer, AutoModel
import torch

# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

def analyze_structure(root_dir):
   structure = []
   for subdir, _, files in os.walk(root_dir):
       for file in files:
           if file.endswith('.py'):
               file_path = os.path.join(subdir, file)
               structure.append(file_path)
               with open(file_path, 'r') as f:
                   tree = ast.parse(f.read(), filename=file_path)
                   imports = [node for node in ast.walk(tree) if isinstance(node, ast.Import) or isinstance(node, ast.ImportFrom)]
                   structure.append(f"Imports in {file_path}: {imports}")
   return structure

def structure_to_vector(structure):
   text_representation = "\n".join(map(str, structure))
   inputs = tokenizer(text_representation, return_tensors='pt', truncation=True, padding=True)
   outputs = model(**inputs)
   vector = outputs.last_hidden_state.mean(dim=1).detach().numpy()
   return vector

root_dir = 'path/to/codebase'
structure = analyze_structure(root_dir)
structure_vector = structure_to_vector(structure)

Perspective 2: Variables, Functions, and Classes

This perspective focuses on the internal components of the code such as functions, classes, variables, and their relationships.

def analyze_code_components(file_path):
   with open(file_path, 'r') as f:
       tree = ast.parse(f.read(), filename=file_path)
       functions = [node for node in ast.walk(tree) if isinstance(node, ast.FunctionDef)]
       classes = [node for node in ast.walk(tree) if isinstance(node, ast.ClassDef)]
       variables = [node for node in ast.walk(tree) if isinstance(node, ast.Name)]
       return functions, classes, variables

def components_to_vector(components):
   components_metadata = "\n".join(map(str, components))
   inputs = tokenizer(components_metadata, return_tensors='pt', truncation=True, padding=True)
   outputs = model(**inputs)
   vector = outputs.last_hidden_state.mean(dim=1).detach().numpy()
   return vector

file_path = 'path/to/file.py'
functions, classes, variables = analyze_code_components(file_path)
components_vector = components_to_vector([functions, classes, variables])

Perspective 3: Implementation Details

This perspective analyzes control flow, cyclomatic complexity, and dependencies within the code.

from radon.complexity import cc_visit
from radon.metrics import h_visit

def analyze_complexity(file_path):
   with open(file_path, 'r') as f:
       code = f.read()
       complexity = cc_visit(code)
       halstead_metrics = h_visit(code)
       return complexity, halstead_metrics

def complexity_to_vector(complexity_metrics):
   complexity_metadata = "\n".join(map(str, complexity_metrics))
   inputs = tokenizer(complexity_metadata, return_tensors='pt', truncation=True, padding=True)
   outputs = model(**inputs)
   vector = outputs.last_hidden_state.mean(dim=1).detach().numpy()
   return vector

complexity, halstead_metrics = analyze_complexity(file_path)
complexity_vector = complexity_to_vector([complexity, halstead_metrics])

Integrating with Supabase and pg_vector

Set Up pg_vector on Supabase

Ensure pg_vector is enabled on your Supabase instance:
```
CREATE EXTENSION IF NOT EXISTS vector;
```

Create a Table for Vectors

Create a table to store the code vectors:

CREATE TABLE code_vectors (
   id SERIAL PRIMARY KEY,
   file_path TEXT,
   perspective TEXT,
   vector vector(768) -- Adjust the dimension based on the vector size from your model
);

Store Vectors in Database

Insert the vectors into the database using Supabase:

import supabase

url = 'https://your-supabase-url.supabase.co'
key = 'your-supabase-key'
client = supabase.create_client(url, key)

def store_vector(file_path, perspective, vector):
   data = {
       'file_path': file_path,
       'perspective': perspective,
       'vector': vector.tolist()  # Convert numpy array to list
   }
   response = client.table('code_vectors').insert(data).execute()
   return response

# Store vectors for each perspective
store_vector('path/to/file.py', 'structure', structure_vector)
store_vector('path/to/file.py', 'components', components_vector)
store_vector('path/to/file.py', 'complexity', complexity_vector)

Search Using Vectors

Implement a search functionality using pg_vector for similarity queries:
```
SELECT * FROM code_vectors
ORDER BY vector <-> '[your_vector_representation]'
LIMIT 5;
```

Conclusion

Using AST to vectorize code from three perspectives—file and folder structure, variables/functions/classes, and implementation details—provides a robust way to understand and analyze the codebase. This method captures different dimensions of the code, enhancing the AI's contextual awareness. Integrating these vectors with pg_vector in Supabase allows for efficient similarity searches, enabling powerful and context-aware suggestions and collaborations.

Further Enhancements

Chunking Code: For large files, consider chunking code into smaller segments before vectorization to maintain manageable vector sizes and improve relevance.
Model Selection: Use models fine-tuned for code understanding, like OpenAI Codex or specialized BERT variants for code.
Real-Time Updates: Implement a mechanism to update vectors in the database whenever the codebase changes.

By following this plan, you'll be able to build a sophisticated system for contextual code understanding and querying.

mouimet-infinisoft / ibrain-cli