seantiz / dryfold-cli

A tool to help me plan C++ codebase migration ahead of time. Dryfold breaks the work down into kanban-board tasks.
Apache License 2.0
0 stars 0 forks source link

fix: Classes being processed as "type unknown" #5

Closed seantiz closed 1 week ago

seantiz commented 2 weeks ago

This is way beyond an edge case, we have to debug and the back this up with some bulk processing of C++ libraries.

seantiz commented 2 weeks ago

Some ways we're brought the edge cases within scope when working on the Typescript version:

analyseCodeTasks()

We changed this to initalise a layerInfo object from the get go, which represents a unit of "tasks" which is analyseCodeTasks() job to shape and return.

Utility Layer Variable and Matching Pattern

We added the isUtilityVariable to our code task analyser.

    tree.rootNode.text.match(/runtime|utils|helpers|shims|parser/i)
    tree.rootNode.text.includes('type Token') ||  // Is Token a parser Type definitively in Typescript?
    tree.rootNode.descendantsOfType('export_statement')
        .some(node => node.text.match(/function\s+(get|is|has|create|parse|tokenize)/)) || // Added parse/tokenize
    tree.rootNode.descendantsOfType('type_alias_declaration')

Core Layer Variable and Matching Pattern

I feel this is still very shaken ground with this one! Needs review.

    const isCoreModule = (
        tree.rootNode.text.includes('extends APIResource') ||
        tree.rootNode.text.includes('import { APIResource }') ||
        tree.rootNode.text.match(/Messages|Streams|Resources/i) ||
        tree.rootNode.descendantsOfType('export_statement')
            .some(node => node.text.match(/class\s+(Message|Stream|Resource|Client)/))
    );

Differencing Between Parsing Modules and Utility Modules (Plus layerInfo {} object)

We filter anything that is a utility module but not a core module and push new layerinfo params to them.

seantiz commented 2 weeks ago

There's a chance that some part of our parsing logic can't handle potentially malformed code from .ts source files

When we were processing Anthropic SDK's TS library, we've seen 'tree-sitter' string getting passed as a value in a URL object. Still more tracing to do to understand where and why this is happening.

Chrome DevTools

Block
href
: 
<value unavailable>
raiseException
: 
true
Local
this
: 
URL
arguments
: 
Arguments ['tree-sitter', callee: (...), Symbol(Symbol.iterator): ƒ]
base
: 
undefined
input
: 
"tree-sitter"
parseSymbol
: 
undefined
seantiz commented 2 weeks ago

There's a chance that some part of our parsing logic can't handle potentially malformed code from .ts source files

When we were processing Anthropic SDK's TS library, we've seen 'tree-sitter' string getting passed as a value in a URL object. Still more tracing to do to understand where and why this is happening.

Chrome DevTools

Block
href
: 
<value unavailable>
raiseException
: 
true
Local
this
: 
URL
arguments
: 
Arguments ['tree-sitter', callee: (...), Symbol(Symbol.iterator): ƒ]
base
: 
undefined
input
: 
"tree-sitter"
parseSymbol
: 
undefined

I couldn't find any evidence of this being true in the end. Whenever I arbitrarily removed 120 LOC from core.ts then tree-sitter parsed everything without explicit errors.

It seems there's a hard character limit on how much you can pass into tree-sitter parser, so we're now chunking larger files and their layer types are (as a compromise for now - though not ideal) being set to "unknown".

seantiz commented 1 week ago

Reopened because we have a lot of modules with layer value "unknown" after refactoring in v0.0.7

seantiz commented 1 week ago

i think we're getting unknown layer returns exclusively from the .dot file logic:

export function createDot(moduleMap: Map<string, DesignValues>) {
    let dot = 'digraph Dependencies {\n';
    dot += '  node [shape=box];\n';

    // Track files with known and unknown layers
    const unknownLayers = new Set<string>();
    const knownNodes = new Set<string>();

    // First pass - collect nodes with known layers
    for (const [file, data] of moduleMap) {
        const nodeName = path.basename(file);
        const className = nodeName.replace('.h', '');

        const layer = (() => {
            const relationships = data.moduleRelationships;
            if (!relationships || !relationships[className]) {
                unknownLayers.add(nodeName);
                return 'unknown';
            }
            return relationships[className].type || 'unknown';
        })();

        if (layer !== 'unknown') {
            knownNodes.add(nodeName);
            dot += `  "${nodeName}" [label="${nodeName}", layer="${layer}"];\n`;
        }
    }

which means the core problem is how we detect moduleRelationships.

modules are being marked as "unknown" when either:

The debug logs show we ARE capturing include relationships:

[DEBUG] lib/osx/poppler-0.66/include/poppler/Object.h includes Array.h [DEBUG] lib/osx/poppler-0.66/include/poppler/Object.h includes Dict.h [DEBUG] lib/osx/poppler-0.66/include/poppler/Object.h includes Stream.h

But these relationships aren't being translated into moduleRelationships in the DesignValues structure.

seantiz commented 1 week ago

Closed - see 0155cab and v.0.8.0 changes for the extended pattern matching based on module relationships and filenames