Open odipar opened 7 months ago
This is due to how virtual dispatch is calculated and currently by design. Virtual dispatch in Java is incredibly complicated to approximate well, and one of the main challenges is finding the right trade-off between open-world and closed-world assumptions. An open-world assumption roughly states that anything matching the type can be a call target - this is necessary in many cases, since analysed Java code rarely exists in complete isolation. A closed-world assumption on the other hand states that a call target only is possible if we can observe data flow of an object of the corresponding type. There are many other details, but this is the rough overview needed to answer this question.
Now lambdas and lambda-like callbacks generally fall into the closed-world case by necessity: we can't dispatch every apply
call to every single lambda (that matches the type) - that's just wrong.
In your above test cases, flow1 is an ordinary lambda and flow3 is an anonymous class instantiation, which is sort of like a lambda. Both of those are common cases for what you expect to pass as a callback. Now flow2 uses a completely ordinary class declaration plus ordinary instantiation, which doesn't look like a lambda at all, so our virtual dispatch heuristics treat it differently and it therefore isn't considered as the source of a closed-world dispatch flow, and hence we don't discover the call edge.
We're always looking to improve our virtual dispatch heuristics, and maybe there's something we can tweak here, but it's not something that's easy nor likely to happen in the short term. But thank you for reporting this, I'll take a note for potential consideration for the next time we look at dispatch improvements.
Thank you for your elaborate reply. I learned a lot from your answer and understand the challenges wrt virtual dispatch. That said, I'm a bit worried about false negatives related to (SQL) code injection in conjunction with (nested, higher order) virtual dispatch patterns. But again, I understand there are limitations to what you can detect statically with codeql.
I hit on an issue while implementing a taint tracking use case. So I've prepared a minimal example that showcases the issue: Here is the java code:
I expect all flows to be in the query result when we taint source(x) with sink(x). However flow2 is not reported?
Here is the codeql query: