github / semantic

Parsing, analyzing, and comparing source code across many languages
8.94k stars 454 forks source link

Scope graphing does not take into account a language’s standard library #165

Open patrickt opened 5 years ago

patrickt commented 5 years ago

As reported in #164, people find the lack of standard-library awareness unintuitive and limiting. This is hard, though, as I pointed out in the issue:

There are a number of suboptimal solutions: we could re-parse and re-graph the entire standard library of a given language before we start graphing user code, which would be prohibitively slow; we could hardcode details of each supported language’s stdlib into Semantic itself, which is bad for maintainability; we could try to do the work in advance and cache it, but getting caching right is really hard. And whatever method we choose needs to scale to all of our core languages and languages not yet integrated with Semantic.

Do note that just because the scope graphing mechanism is unaware of the standard library doesn’t mean that it ignores stdlib calls—they are tracked and graphed like any other call, they just lack position information. So this isn’t a “we’re missing stdlib support”, it’s a “how do we annotate stdlib calls with their position information (if any), and how do we get that position information from the stdlib?"

zfy0701 commented 5 years ago

seems hardcode is a good start for the users. And I assume the file that has all the hardcoded could potentially be generated by analyzing the system library using semantic itself

patrickt commented 5 years ago

This is made problematic by the fact that runtime environments matter. It’s not clear, given a chunk of arbitrary JS, whether it’s intended to run in Node, or in a browser, or to be compiled to wasm.

zfy0701 commented 5 years ago

I would argue for JS, if package.json is presented, you could always assume it's in Node environment and have access to all the node/WASM library.

it's possible that the code is actually mean to be run in browser, in which case any reference to NodeJs will cause a runtime failure or a bundling failure. But before the bundling I think it's fine they reference whatever they want

patrickt commented 5 years ago

I would argue for JS, if package.json is presented, you could always assume it's in Node environment

These are the kind of assumptions we don’t want to make, I’m afraid. Semantic is not a tool for Node, it’s a tool for cross-language abstract interpretation: we shouldn’t give Node priority, especially given that the webpack and browserify tools allow deploying package.json-based applications to browsers. Privileging one JS environment over others isn’t the right choice, even if it would provide better information in the short-term.

In the large, every assumption that ties us to implementation details of a language’s runtime or packaging situation is at cross purposes with our goal for Semantic, which is to write a toolkit that is powerful enough to analyze arbitrary code across a range of programming languages without reimplementing or irrevocably tying us to those languages’ canonical interpreters and runtimes. Indeed, one of the reasons we really like abstract interpretation is that we don’t have to provide all runtime primitives: abstract interpretation is capable of skipping over the constructs it doesn’t understand and still returning useful values.

Other things that make this difficult/not amenable to a quick fix:

As you can see, there’s a lot to think about, so we don’t want to rush into an implementation that we might later regret. But thank you for your suggestions, and your enthusiasm! We look forward to having the spare cycles to take a swing at this problem.

robrix commented 5 years ago

Just to reinforce what @patrickt’s said, the goal is for the caller to determine the assumptions they want us to analyze under: e.g. this version of that language with these dependencies. We’re not quite there yet in general, but that’s the plan.

Longer-term (e.g. in a world post-#119), I’m hoping this will mean that we’ll have different stubs (at least) of standard libraries represented as data somewhere which callers can use, if they wish. (Generating these from sources would be nice, where feasible, but I haven’t put much thought into that yet.) Likewise, I’m hoping that we’ll be able to accommodate different language versions in a single AST/compiler, as indeed the parsers are currently designed. But regardless, we are trying to bake fewer assumptions into the system as time goes on, and instead allow callers to select them for themselves.

Separately, we might provide some heuristics in some cases, like how the .rb path extension gets mapped to Ruby; but I’m mentally planning to separate those into into the driver instead of the library wherever possible. (See also #136.)

zfy0701 commented 5 years ago

Completely understand, another question I have and similar to this issue is how do you handle Java's dependencies, which is mostly bytecode. I'm not sure if you guys want to implement those mechnism inside semantic itself, so I think, maybe it's just simple to leave some options to the user. Say if I want to analyze ruby project depends ruby stdlib, I could provide a list of external symbols during the process, if I want to analyze Java, I could generate external symbols from bytecode and feed those information to the analyzere

robrix commented 5 years ago

@zfy0701: Broadly yep, that’s the plan. Modular handling of dependencies is essential for performance, especially at scale, and maintaining that sort of separation is key to modularity.

Depending on the details of the analysis producing the symbols, it could be mechanically tricky, but that’s how we’ve been thinking about it 👍