github / codeql

CodeQL: the libraries and queries that power security researchers around the world, as well as code scanning in GitHub Advanced Security
https://codeql.github.com
MIT License
7.51k stars 1.49k forks source link

Python: How to find type information for a specific variable or object #16961

Open R3x opened 2 months ago

R3x commented 2 months ago

Hello, I am writing python queries for some libraries and I was trying to find all the types in the program and group api's which use Type X, Type Y etc.

But the current API doesn't seem to have a way to connect type to say function paramaters or anything of that sort. Does CodeQL support this functionality? Is it possible to get atleast an imprecise list of possible types for a paramater of a function?

tausbn commented 2 months ago

We don't really have this sort of functionality in the CodeQL Python libraries. There's an old, unsupported, and bitrotted part of the libraries that does a "points-to" analysis (i.e. figures out what possible values a given program element may point to at runtime), but I tried it and it doesn't seem to work (and at any rate I wouldn't recommend using it).

With that in mind, I have two suggestions I can make.

If you want to access the types as they are specified in the source code, then the key you're looking for is the getAnnotation method on the Parameter class. Note that this will just give you the raw expression that is used as an annotation. We do not have any library support for actually interpreting this expression as a type.

As for inference, I think the best option would be to use API graphs, but I think this is likely to be very noisy. What I'm thinking is that you could do something like

import python
import semmle.python.dataflow.new.DataFlow
import semmle.python.ApiGraphs

predicate has_type(DataFlow::ParameterNode n, string type) {
    exists(API::Node a |
      a.getAValueReachableFromSource() = n and
      a = API::moduleImport(_).getAMember*().getReturn*() and
      type = a.toString()
    )
}

API graphs basically work by approximating the set of possible access paths of values in the code. For instance, if you do

import foo

then whereever that foo flows corresponds to the API graph node given by API::moduleImport("foo"). The getMember method corresponds to attribute accesses, and getReturn to calls. Thus, the line

     a = API::moduleImport(_).getAMember*().getReturn*() and

restricts the set of API graph nodes to be just those that are (calls to) attributes of modules (which is perhaps an okay approximation of "type"). If you remove that line, you'll get a lot more results, some possibly somewhat nonsensical.

Note that this is limited by our ability to figure out what flows where in the code, and this can only ever be an approximation.

R3x commented 2 months ago

That's a really great idea! Thanks @tausbn However there are still some issues. I am trying to run an analysis on the libraries themselves as the targets. I believe that the API graphs doesn't capture the calls inside the source itself?

Basically I want to run a query on libraries as the source where I want to identify what are the types for each of the parameters and then group them based on which use the same types for arguments. I agree that maybe using an repo which uses the library makes sense, but then it might only use a limited set of APIs of the library.

Is there some possibility of using some sort of TypeTracking stuff here either? But I don't know what to use as the source, It seems hard to find generators or each type and then propagate them across stuff.

tausbn commented 2 months ago

That's a really great idea! Thanks @tausbn However there are still some issues. I am trying to run an analysis on the libraries themselves as the targets. I believe that the API graphs doesn't capture the calls inside the source itself?

You are right that API graphs are not well suited for tracking things that are defined in the given codebase itself. For instance, if you have something like the following code

class A:
    ...

def foo(x):
    ...

foo(A())

then API graphs will not be able to figure out that x could be an instance of A because both x and A are in the same file. If A was imported, however, then it should find this instance, however. (Basically, as far as the API graphs are concerned, anything outside of the present file is an "external API", even if the implementation of said API is elsewhere in the same codebase.)

Basically I want to run a query on libraries as the source where I want to identify what are the types for each of the parameters and then group them based on which use the same types for arguments. I agree that maybe using an repo which uses the library makes sense, but then it might only use a limited set of APIs of the library.

I'm not entirely sure what you're trying to do based on your description. Maybe a small example would help?

Is there some possibility of using some sort of TypeTracking stuff here either? But I don't know what to use as the source, It seems hard to find generators or each type and then propagate them across stuff.

Sure, you could do it using type trackers. In fact, we already have type trackers for classes and class instances. With that in mind, perhaps the following code more accurately captures what you want:

import semmle.python.dataflow.new.DataFlow
import semmle.python.dataflow.new.internal.DataFlowDispatch
import semmle.python.ApiGraphs
import python

predicate has_type(DataFlow::ParameterNode n, string type) {
  exists(API::Node a |
    a.getAValueReachableFromSource() = n and
    a = API::moduleImport(_).getAMember*().getReturn*() and
    type = a.toString()
  )
  or
  exists(Class c |
    n = classTracker(c) and type = c.getName()
    or
    n = classInstanceTracker(c) and type = c.getName() + " instance"
  )
}

Here, the classTracker and classInstanceTracker (which may be the one you're most interested in) will track "types" that come from the codebase itself, whereas the API graphs will track built-in types and ones defined in external dependencies.

Bear in mind that both of these are part of an internal API, and as such may change without warning.

R3x commented 2 months ago

Here is a better explanation of what I want to achieve. I am using codeQL to extract information about python libraries as a part of a pipeline. The information I want is basically something as follows :

I am running these on the libraries themselves, the assumption is that there are enough test cases that these exported APIs are being called at some point with specific types being passed to them. This information is later parsed to group them together (based on types) and perform other analysis and generate statistics.

I have CodeQL queries to extract all the information about functions/methods/callgraphs etc - but I am not able get type information for each of the parameters/attributes/return values. Annotated type hints are a bit rare.


That aside, the type tracking query does seem to give me a lot of class object types, and gives me some leads on improving it, so that's really helpful.. Thank you!

Seems like there's a limitation in identifying the basic datatypes such as List, str etc. But this is a good start (and I have managed to create some queries to do 1, 2 and 3 - for the class types atleast) and I will play around with the API to see what all I can do!