github / codeql

CodeQL: the libraries and queries that power security researchers around the world, as well as code scanning in GitHub Advanced Security
https://codeql.github.com
MIT License
7.53k stars 1.5k forks source link

explicit java Function<X,Y> implementation is not tainted? #15494

Open odipar opened 7 months ago

odipar commented 7 months ago

I hit on an issue while implementing a taint tracking use case. So I've prepared a minimal example that showcases the issue: Here is the java code:

import java.util.Optional;
import java.util.function.Function;

public class SourceToSinkBug {
    public class DoubleToString implements Function<Double, String> {
        public String apply(Double x) { return x.toString(); }
    } 
    public String flow0() { // toString() is tainted
        Double source0 = 1.0;
        String sink0 = source0.toString();
        return sink0;
    }

    public String flow1() { // Lambda is tainted
        Double source1 = 1.0;
        Optional<Double> opt1 = Optional.of(source1);
        Optional<String> map1 = opt1.map(x -> x.toString());
        String sink1 = map1.get();
        return sink1;
    }
    public String flow2() { // BUG?: DoubleToString *isn't* tainted?
        Double source2 = 2.0;
        Optional<Double> opt2 = Optional.of(source2);
        Optional<String> map2 = opt2.map(new DoubleToString());
        String sink2 = map2.get();
        return sink2;
    }
    public String flow3() { // Inline function is tainted
        Double source3 = 3.0;
        Optional<Double> opt3 = Optional.of(source3);
        Optional<String> map3 = opt3.map(
            new Function<Double,String>(){ public String apply(Double x) { return x.toString(); }});
        String sink3 = map3.get();
        return sink3;
    }
}

I expect all flows to be in the query result when we taint source(x) with sink(x). However flow2 is not reported?

Here is the codeql query:

import semmle.code.java.dataflow.DataFlow
import semmle.code.java.dataflow.TaintTracking

class Source1 extends VarAccess {
  Source1() { this.getVariable().getName().matches("source%")}
}
class Sink1 extends VarAccess {
  Sink1() { this.getVariable().getName().matches("sink%") }
}

// source% to sink%
module Config implements DataFlow::ConfigSig {
  predicate isSource(DataFlow::Node source) { source.asExpr() instanceof Source1 }
  predicate isSink(DataFlow::Node sink) { sink.asExpr() instanceof Sink1 }
}

module MyFlow = TaintTracking::Global<Config>;

from DataFlow::Node source, DataFlow::Node sink
where MyFlow::flow(source, sink)
select source, sink, "source to sink"
aschackmull commented 7 months ago

This is due to how virtual dispatch is calculated and currently by design. Virtual dispatch in Java is incredibly complicated to approximate well, and one of the main challenges is finding the right trade-off between open-world and closed-world assumptions. An open-world assumption roughly states that anything matching the type can be a call target - this is necessary in many cases, since analysed Java code rarely exists in complete isolation. A closed-world assumption on the other hand states that a call target only is possible if we can observe data flow of an object of the corresponding type. There are many other details, but this is the rough overview needed to answer this question. Now lambdas and lambda-like callbacks generally fall into the closed-world case by necessity: we can't dispatch every apply call to every single lambda (that matches the type) - that's just wrong. In your above test cases, flow1 is an ordinary lambda and flow3 is an anonymous class instantiation, which is sort of like a lambda. Both of those are common cases for what you expect to pass as a callback. Now flow2 uses a completely ordinary class declaration plus ordinary instantiation, which doesn't look like a lambda at all, so our virtual dispatch heuristics treat it differently and it therefore isn't considered as the source of a closed-world dispatch flow, and hence we don't discover the call edge. We're always looking to improve our virtual dispatch heuristics, and maybe there's something we can tweak here, but it's not something that's easy nor likely to happen in the short term. But thank you for reporting this, I'll take a note for potential consideration for the next time we look at dispatch improvements.

odipar commented 7 months ago

Thank you for your elaborate reply. I learned a lot from your answer and understand the challenges wrt virtual dispatch. That said, I'm a bit worried about false negatives related to (SQL) code injection in conjunction with (nested, higher order) virtual dispatch patterns. But again, I understand there are limitations to what you can detect statically with codeql.