stakwork / sphinx-nav-fiber

14 stars 48 forks source link

[Lambda Function] Building a Knowledge Graph from a Go Codebase Using ASTs #2204

Open tomsmith8 opened 2 months ago

tomsmith8 commented 2 months ago

Task Overview

Codebase: https://github.com/stakwork/sphinx-tribes Language: go

Develop a tool or script that can be packaged up as an AWS Lambda function that analyzes an entire Go (go) codebase to extract and represent its structural elements into json.

The tool should parse the code using Abstract Syntax Trees (ASTs) and generate nodes and edges in a predefined JSON format. The goal is to accurately extract various components of the codebase, including:

(Include any additional elements that ASTs can accurately extract and are valuable for understanding the codebase)

Strategy

  1. Setup and Tooling

Utilize Go's standard library packages (go/parser, go/ast, go/token) for parsing the codebase. OR use tree-sitter

  1. Parsing the Codebase

Iterate over all relevant .go files in the codebase. Parse each file into an AST for detailed analysis.

  1. Extracting Code Elements

Traverse the ASTs to identify and extract the required node types. Capture details such as names, parameters, return types, variable declarations, and type definitions.

  1. Building Nodes and Edges

Define nodes for each extracted element. Establish edges representing relationships between nodes (e.g., function calls, method receivers, type embeddings).

  1. Use a predefined JSON structure for consistency.

  2. Generating the Knowledge Graph

Compile the nodes and edges into a coherent knowledge graph. Output the graph in the specified JSON format.

  1. Validation and Testing

Verify the accuracy of the extracted data. Ensure that the knowledge graph correctly represents the codebase structure.

Nodes and Edges Structure (JSON)

Node Structure

Each node will have the following structure:

{
  "node_type": "<NodeType>",
  "node_data": {
    "name": "<Name>"
  }
}

Edge Structure

Each edge will have the following structure:

{
  "edge": {
    "edge_type": "<EdgeType>"
  },
  "source": {
    "node_type": "<SourceNodeType>",
    "node_data": {
      "name": "<SourceName>"
    }
  },
  "target": {
    "node_type": "<TargetNodeType>",
    "node_data": {
      "name": "<TargetName>"
    }
  }
}

Example Nodes

Node

{
  "node_type": "Function",
  "node_data": {
    "name": "CalculateTotal",
    "package": "github.com/example/project/helpers",
    "file": "helpers/order_helpers.go",
    "line_number": 42
  }
}

Example Edges

Function- > Calls -> Function

{
  "edge": {
    "edge_type": "CALLS"
  },
  "source": {
    "node_type": "Function",
    "node_data": {
      "name": "CalculateTotal"
    }
  },
  "target": {
    "node_type": "Function",
    "node_data": {
      "name": "ApplyDiscount"
    }
  }
}

Required Output

Deliverables

Acceptance Criteria

tomsmith8 commented 2 months ago

Bounty posted: https://community.sphinx.chat/bounty/2442

tomsmith8 commented 2 months ago

Since the codebase is in Go and requires parsing Go code, it's most efficient to use Go for the parsing logic. However, since AWS Lambda supports custom runtimes and Docker images, we can package the Go application in a Docker container for Lambda deployment.

If anyone has experience with this, let me know