github / codeql

CodeQL: the libraries and queries that power security researchers around the world, as well as code scanning in GitHub Advanced Security
https://codeql.github.com
MIT License
7.51k stars 1.49k forks source link

BUGS report #9864

Closed flowingair closed 2 years ago

flowingair commented 2 years ago

In some situations,codeql will connect unrelated nodes and believe that there is a flow between them. for example,when using TaintTracking::Configuration or sinkNode.

Example:

Java file:

import java.net.MalformedURLException;
import java.net.URL;
import java.util.HashMap;

public class FakeSink extends HashMap {
    public FakeSink() {
    }

    public Object put(Object name, Object value) {
        try {
            return new URL((String) value);
        } catch (MalformedURLException e) {
            e.printStackTrace();
        } finally {
            return null;
        }
    }
}

import java.net.MalformedURLException;
import java.net.URL;

public class FakeSource {
    public FakeSource()
    {
        String data="";
        Sink sink = new Sink(data);
        try {
            new URL(sink.getData());
        } catch (MalformedURLException e) {
            e.printStackTrace();
        }
    }
}

import java.util.HashMap;
import java.util.Map;

public abstract class Path {
    private Map<String, String> initParams = new HashMap<String, String>();

    public void init(String featureName, Map<String, String> initParam) {
        this.initParams = initParam;
    }

    public Map<String, String> getInitParams() {
        return this.initParams;
    }
}

public class Sink {
    private String data;

    public Sink(String data) {
        this.data = data;
    }

    public String getData() {
        return data;
    }
}

public class Sink2 extends Path {
    public Sink2(String data) {
        getInitParams().put("name", data);
    }
}

ql codes

class Template extends TaintTracking::Configuration {
  Template() { this = "template" }

  override predicate isSource(DataFlow::Node source) {
    exists(Constructor con |
      con.getNumberOfParameters() = 1 and
      "Ljava/lang/String;" = con.getParameterType(0).getTypeDescriptor() and
      con.getAParameter() = source.asParameter()
    )
  }

  override predicate isSink(DataFlow::Node sink) {
    exists(ConstructorCall cc |
      sink.asExpr() = cc and
      cc.getConstructor().hasQualifiedName("java.net", "URL", "URL")
    )
  }
}
result
1 data : String Sink.java:4:17
2 data : String Sink.java:5:21
3 this [post update] [data] : String Sink.java:5:9
4 new Sink(...) [data] : String FakeSource.java:8:21
5 sink [data] : String FakeSource.java:10:21
6 parameter this [data] : String Sink.java:8:19
7 this <.field> [data] : String Sink.java:9:16
8 data : String Sink.java:9:16
9 getData(...) : String FakeSource.java:10:21
10 new URL(...) FakeSource.java:10:13
1 data : String Sink2.java:2:18
2 data : String Sink2.java:2:18
3 (...)... : String FakeSink.java:11:28
4 new URL(...) FakeSink.java:11:20
edoardopirovano commented 2 years ago

Many thanks for getting in touch with this question! It looks to me like things are working as expected. I ran:

from DataFlow::PathNode source, DataFlow::PathNode sink, Template conf
where
  conf.hasFlowPath(source, sink)
select source, sink

using your definition of Template above and your example. I get four results, which I think correspond to your second table (I'm not sure what your first table is showing). These correspond to:

  1. A flow from the constructor of Sink to the sink.getData() usage in FakeSource. This looks like a valid flow from me: you've defined sources as the argument of a constructor that takes in a String (which is certainly the case for the constructor of Sink), and sinks as usages of data in the constructor of a URL (which the usage of sink.getData() in FakeSource is). It's a little confusing that the way you've named your classes the source and sink are flipped, but the analysis still looks valid.
  2. A flow from the constructor of Sink2 into the usage in FakeSink. Similarly to the above, the sources and sinks seem valid. Why the flow happens is a bit more subtle, but FakeSink extends HashMap, so it could in theory be a return value of getInitParams(). As it happens, we can see in this case that it will never be, because there's nowhere in the code we set initParams to something which is a FakeSink. Our analysis isn't smart enough to see that, but I don't think it's a bug as such, just a subtlety of some paths we could rule out but don't.
  3. There's a flow from the constructor of URL to itself, which seems valid since URL is a constructor with a single String argument which makes its argument a source by your definition, but it's also the constructor you've defined the parameters for as being sinks, so it's also a sink.
  4. Similarly for the other case of a constructor of URL.

If you'd like to elaborate on what you're trying to achieve, I'd be happy to further help you write a query that can do that.

flowingair commented 2 years ago

Thank you for your reply.

I am analyzing a huge project with codeql.But the result is too bad:codeql is misleading by those bugs and thus useful imformations are blocked by those wrong ones.I really believe that those bugs should be fixed or codeql is just another tiny toys for helloworld projects.

Both tables shows how the codeql is misled.

For the first bugs,just like what you know, codeql links unrelated nodes because it cant determine the return value from methods.This can be avoid by not using TaintTracking::Configuration and sinkNode. For the second bugs,in the first table,source is the constructor of Class Sink, sink is java.net.URL.There is no way that the source can flow to sink (the Class FakeSource have never been called).But codeql links the this from this.data to any ConstructorCall of Class Sink, and tells that the constructor of Class Sink have called the codes from FakeSource.

In a huge project with so many beans and polymorphism,its a disaster.In order to found something useful,I have to followed tens of thousands of flows.I really need a more accurate way to analyzes codes. what should i do?

edoardopirovano commented 2 years ago

I'm really sorry that CodeQL isn't working out the way you'd like it to for your analysis. I've outlined above why the flows we are seeing for your case seem valid to me. The fact that some of the code isn't reachable because FakeSource isn't called anywhere isn't, I think, important. The code is "vulnerable" (by the definition of vulnerable given in your Template) regardless of whether it happens to be reachable. If there is significant unreachable code in your code base, then that is a separate concern (and one that we also have queries to look for - see java/ql/src/DeadCode).

If there's specific paths within your code that you want to exclude from having data flow tracked through them, then you could use a sanitizer as described here, but it sounds like your concerns are more general with the approximations we have made in how we track data flow (such as the fact that we track type information but not actual values when determining what paths could be possible), so I'm not sure that will solve your issues.

If you have any specific questions I could answer, do let me know and I'll be happy to help you our further.

aschackmull commented 2 years ago

As for the path in FakeSource, I'll just echo what Edoardo wrote: We generally assume that code exists for a reason and therefore assume that FakeSource can be called in some way (e.g. reflection or by compiling as a library that will be included in another project).

As for the path ending in FakeSink, the FP part of the path really boils down to the virtual dispatch of put in the line getInitParams().put("name", data);. Virtual dispatch like this is insanely tricky to get right, and something that we're continually trying to improve. In this case, I think we ought to have this working out of the box, if the init method had been private, because that would have allowed us to infer a more precise type for initParams. On the other hand, since this case is specifically targeting a well-known interface such as Map and the false dispatch happens to end up in a custom Map implementation, then that might just be covered by improvements that we're currently considering (no timeline yet), so I'd expect that case to be fixed in the medium term (we'll likely shift to prioritise fixed models for Map rather than incidental custom Maps unless we see positive evidence that the dispatch is possible - i.e. make a slight shift in the open-world/closed-world assumptions that are at play here).

flowingair commented 2 years ago

Actually, There are codes that exists for no reason.Its no 80s.Java programer wont build the building from ground.We just glue everytings and somehow, it works.There would be dead codes from JDK, SDK, opensouce library and so on.No one dare to remove them incase the collasping of building.

flowingair commented 2 years ago

Had fix this by removing unnessary codes form below functions: predicate summaryLocalStep(Node pred, Node succ, boolean preservesValue) predicate simpleLocalFlowStep(Node node1, Node node2)

aschackmull commented 2 years ago

Fixed by https://github.com/github/codeql/pull/10416