python taint tracking doesn't work with namespace packages properly

am0o0 commented 1 year ago

add these files under same main directory a.py

import cgi
import b
req = cgi.FieldStorage()
ssh = b.cmd_exec(req.getvalue('cmd'))

b.py

def cmd_exec(cmd):
    from mymodule import exec_mymodule
    cmd_output = exec_mymodule("bach"," -c",cmd)
    return cmd_output

Run the following simple taint tracking path query

/**
 * @name myTaint
 * @description myTaint
 * @kind path-problem
 * @problem.severity error
 * @security-severity 9.3
 * @sub-severity high
 * @precision high
 * @id py/myTaint
 * @tags security
 *       external/cwe/cwe-094
 *       external/cwe/cwe-095
 *       external/cwe/cwe-116
 */
import python
import semmle.python.dataflow.new.DataFlow
import semmle.python.dataflow.new.TaintTracking
import semmle.python.dataflow.new.RemoteFlowSources
import DataFlow::PathGraph
import semmle.python.ApiGraphs
class Configuration extends TaintTracking::Configuration {
  Configuration() { this = "Configuration"}

  override predicate isSource(DataFlow::Node source) {
    source = API::moduleImport("cgi").getMember("FieldStorage").getReturn().asSource()
  }

  override predicate isSink(DataFlow::Node sink) {
      any()
  }
  }

  from Configuration config, DataFlow::PathNode source, DataFlow::PathNode sink
  where config.hasFlowPath(source, sink)
  select sink.getNode(), source, sink, "This code execution depends on a $@.", source.getNode(),
    "user-provided value"

You can see that flow can't go into b.cmd_exec from a.py @jketema on the GHSL slack said that if I change the directory structure as follows:

code
 |- a.py
 |- pack
     |- __init__.py
     |- b.py

It works, and I tested it, and it has worked for me, and I could track cmd to exec_mymodule as the final Sink.

jketema commented 1 year ago

Hi @amammad. Thanks for opening this issue. I'll relay this to the responsible team.

tausbn commented 1 year ago

While there may be an issue with namespace packages, I don't think the behaviour you're seeing is because of this.

The problem here is the way Python handles imports when invoked from the same directory as the file being executed.

Let's say the directory layout is as follows:

main
├── a1.py
├── a2.py
├── b.py
└── logging.py

With the following file contents

a1.py:

print("entering a1.py")
import b
import logging
print("leaving a1.py")

a2.py:

print("entering a2.py")
import main.b
import logging
print("leaving a2.py")

b.py:

print("entering b.py")
print("leaving b.py")

logging.py:

print("entering logging.py")
print("leaving logging.py")

(Basically, a1.py is the situation you describe above, and a2.py is a variation where the import of b is qualified with the package name.)

Now, if we run python a1.py from within the main directory, we get the following output:

entering a1.py
entering b.py
leaving b.py
entering logging.py
leaving logging.py
leaving a1.py

We get the same input if we invoke it as a module: python -m a1.

On the other hand, both python a2.py and python -m a2 result in a ModuleNotFoundError for main.

From outside the main directory, we can also try to invoke a1 and a2 as scripts and modules respectively. Doing python main/a1.py results in the same output as before, but this time around python -m main.a1 gives us a ModuleNotFoundError for b!

Likewise, python main/a2.py fails as before, but python -m main.a2 now gives the following output:

entering a2.py
entering b.py
leaving b.py
leaving a2.py

Note in particular that the local copy of logging is not imported. What is imported instead is the logging module from the standard library!

Okay, so what's the point of all this? Well, because the semantics of imports changes drastically depending on how the code is run, our analysis has to make some assumptions about how the code is run.

In particular, since the vast majority of Python code is not run as scripts, we assume that imports should be resolved as if the code was run as a module.

This is why we get the behaviour you describe above.

There are some ways to convince the analysis to treat the code as a script, however. You could

add a #! line referencing the python interpreter as the first line of the file, or
add an if __name__ == "__main__": ... block to the file, for instance.

In that case, our analysis should be able to resolve the imports correctly.

Ultimately all of this gets resolved in a rather gnarly piece of code.

am0o0 commented 1 year ago

It was interesting information. So why are you making these assumptions? Is it possible to take some additional parameters from users when they are running the query on their code bases? I Found some CVEs that CodeQL could easily find the vulnerability and it couldn't because of this problem. I think if it is possible for maintainers to specify the execution entry point path, it would be great?

tausbn commented 1 year ago

It was interesting information. So why are you making these assumptions?

Primarily to avoid false positives due to modules being resolved incorrectly. In the case with the local version of logging.py, for instance, we don't want to resolve this to two different modules, so we have to decide on one of them.

Is it possible to take some additional parameters from users when they are running the query on their code bases?

I Found some CVEs that CodeQL could easily find the vulnerability and it couldn't because of this problem. I think if it is possible for maintainers to specify the execution entry point path, it would be great?

This is a possibility, but not one we're actively pursuing.

What we are considering is whether or not we should resolve import b to b.py in the case where it otherwise wouldn't resolve to anything. Doing so would likely make things work in the above case. The import of logging would still resolve to the standard library module, however (since we want modules to resolve uniquely).

Alternatively, we could extend the heuristics for what constitutes a potential entry point, e.g. looking for the presence of a sufficient amount of stateful code at the top level.

am0o0 commented 1 year ago

Thanks for good explanations. Should we close this issue ?

tausbn commented 1 year ago

I'm fine with the issue staying open. It might be helpful to others encountering the same issue (until we merge a fix).

github / codeql

python taint tracking doesn't work with namespace packages properly #11780