github / stack-graphs

Rust implementation of stack graphs
https://docs.rs/stack-graphs/*/stack_graphs/
Apache License 2.0
771 stars 141 forks source link

Error-Resilient Indexing: Ignoring Failing Nodes to Build Partial Stack Graphs #428

Open nohehf opened 7 months ago

nohehf commented 7 months ago

Hello,

I've been facing some issues when indexing source code using the tree-sitter-stack-graphs-python crate, particularly with failed to build stack graph errors linked to issues #426 & #403.

I'm aware these might be resolved in future updates, and I'm also considering diving into the .tsg DSL as I become more familiar with this project. Tho I believe that given the complexity of various codebases, similar errors could arise and the future, hence the idea of and having a way to safely index, ignoring potential errors.

I noticed that GitHub's code navigation still functions correctly on the very same files that are throwing errors. This leads me to believe there might be a workaround to build the stack graph despite these errors.

Is there a known method to bypass such errors during the graph building process, so the entire codebase isn't compromised by issues in single nodes? If such a feature exists or could be implemented, it would greatly enhance the tool's resilience.

Looking forward to your insights or suggestions.

PS: thank you for this amazing project and related talks / articles, it's been really enjoyable to learn about this

hendrikvanantwerpen commented 7 months ago

Recently https://github.com/github/stack-graphs/pull/419 was merged, which ensures non-failing files are indexed even if others are not. Is that what you are after?

nohehf commented 7 months ago

Oh ok I did notice this after posting thank you. I guess the CLI output might not reflect this (seemed like a full crash to me). That is indeed, what I was looking for, but do you think it could be more granular than per file ?

Let's say I have the following main.py file.

foo = "bar"

# this line will fail the indexing
baz = lambda a: a * 2

qux = foo

It does fail on the baz assignment:

> tree-sitter-stack-graphs-python index main.py
/Users/nohehf/tmp/py/main.py: failed to build stack graph
    0: Error executing statement edge @name.def -> @param.param_name at (822, 3)
       src/stack-graphs.tsg:822:3:
       822 |   edge @name.def -> @param.param_name
           |   ^
       in stanza
       src/stack-graphs.tsg:818:1:
       818 | (parameter/identifier) @param @name
           | ^
       matching (identifier) node
       /Users/nohehf/tmp/py/main.py:4:14:
       4 | baz = lambda a: a * 2
         |              ^
    1: Evaluating edge sink
    2: Undefined scoped variable [syntax node identifier (4, 14)].param_name

Error: failed to build stack graph

And it seems like the file has not be indexed at all tree-sitter-stack-graphs-python status main.py outputs nothing.

As I mentioned, GitHub code navigation still works in this file ? Do they run additional / custom .tsg files ? Or have custom error recovery techniques ?

Do you think it would be possible to only fail on portions of the file ? Like foo and qux would be indexed in the stack graph, and only the problematic / failing node baz is ignored ?

PS: Thanks for the quick answer :))

hendrikvanantwerpen commented 7 months ago

As I mentioned, GitHub code navigation still works in this file? Do they run additional / custom .tsg files? Or have custom error recovery techniques?

If stack graph data is unavailable, either because the file failed or the language is not supported yet, we fall back on simple name matching using tags. This is based on tags queries that are found in many grammar repos, which don't take scoping into account.

[...] do you think it could be more granular than per file?

Perhaps. It would require a more invasive change in tree-sitter-graph, which is responsible for executing the TSG files. The simplest version would be to skip graph statements like edge and attr if it uses scoped variables that are undefined. But I would not like that to happen silently, because it is an indication that there's something missing or an error. They should at least be surfaced as warnings, and perhaps a flag could switch between relax and strict mode. This might be possible while keeping clear execution semantics for the TSG, but it would be some work.

nohehf commented 7 months ago

If stack graph data is unavailable, either because the file failed or the language is not supported yet, we fall back on simple name matching using tags. This is based on tags queries that are found in many grammar repos, which don't take scoping into account.

That does make sense, thank you.

This is based on tags queries that are found in many grammar repos

Are you referring to tools like ctags or related tools / techniques ?

Perhaps. It would require a more invasive change in tree-sitter-graph, which is responsible for executing the TSG files. The simplest version would be to skip graph statements like edge and attr if it uses scoped variables that are undefined. But I would not like that to happen silently, because it is an indication that there's something missing or an error. They should at least be surfaced as warnings, and perhaps a flag could switch between relax and strict mode. This might be possible while keeping clear execution semantics for the TSG, but it would be some work.

Yes I 100% agree that this should not be silent (as skipping files). I would like to get my hands on this if I have some time, but I believe I need to familiarize myself quite a bit more with the repo.

Thank you for your answers on the matter, I think this issue could be flagged as a feature idea / proposal, no bug here :))

hendrikvanantwerpen commented 7 months ago

I would like to get my hands on this if I have some time, but I believe I need to familiarize myself quite a bit more with the repo.

Let me know if you want to do this, then I can point you to the right places and discuss a bit what would be a good approach. I suggest to start with a new issue on tree-sitter-graph to track the effort and discuss overall approach. I think it is easiest to break it up in two parts, (i) implementing a way to return warnings from an execution run, and (ii) changing the behavior around missing nodes and generate warnings for those cases.

nohehf commented 7 months ago

Thank you i'll make sure to ping you back here when I have some time. I first need to address #430 :)