Open am0o0 opened 1 year ago
Hi @amammad. Thanks for opening this issue. I'll relay this to the responsible team.
While there may be an issue with namespace packages, I don't think the behaviour you're seeing is because of this.
The problem here is the way Python handles imports when invoked from the same directory as the file being executed.
Let's say the directory layout is as follows:
main
├── a1.py
├── a2.py
├── b.py
└── logging.py
With the following file contents
a1.py
:
print("entering a1.py")
import b
import logging
print("leaving a1.py")
a2.py
:
print("entering a2.py")
import main.b
import logging
print("leaving a2.py")
b.py
:
print("entering b.py")
print("leaving b.py")
logging.py
:
print("entering logging.py")
print("leaving logging.py")
(Basically, a1.py
is the situation you describe above, and a2.py
is a variation where the import of b
is qualified with the package name.)
Now, if we run python a1.py
from within the main
directory, we get the following output:
entering a1.py
entering b.py
leaving b.py
entering logging.py
leaving logging.py
leaving a1.py
We get the same input if we invoke it as a module: python -m a1
.
On the other hand, both python a2.py
and python -m a2
result in a ModuleNotFoundError
for main
.
From outside the main
directory, we can also try to invoke a1
and a2
as scripts and modules respectively.
Doing python main/a1.py
results in the same output as before, but this time around python -m main.a1
gives us a ModuleNotFoundError
for b
!
Likewise, python main/a2.py
fails as before, but python -m main.a2
now gives the following output:
entering a2.py
entering b.py
leaving b.py
leaving a2.py
Note in particular that the local copy of logging
is not imported. What is imported instead is the logging
module from the standard library!
Okay, so what's the point of all this? Well, because the semantics of imports changes drastically depending on how the code is run, our analysis has to make some assumptions about how the code is run.
In particular, since the vast majority of Python code is not run as scripts, we assume that imports should be resolved as if the code was run as a module.
This is why we get the behaviour you describe above.
There are some ways to convince the analysis to treat the code as a script, however. You could
#!
line referencing the python
interpreter as the first line of the file, orif __name__ == "__main__": ...
block to the file, for instance.In that case, our analysis should be able to resolve the imports correctly.
Ultimately all of this gets resolved in a rather gnarly piece of code.
It was interesting information. So why are you making these assumptions? Is it possible to take some additional parameters from users when they are running the query on their code bases? I Found some CVEs that CodeQL could easily find the vulnerability and it couldn't because of this problem. I think if it is possible for maintainers to specify the execution entry point path, it would be great?
It was interesting information. So why are you making these assumptions?
Primarily to avoid false positives due to modules being resolved incorrectly. In the case with the local version of logging.py
, for instance, we don't want to resolve this to two different modules, so we have to decide on one of them.
Is it possible to take some additional parameters from users when they are running the query on their code bases?
I Found some CVEs that CodeQL could easily find the vulnerability and it couldn't because of this problem. I think if it is possible for maintainers to specify the execution entry point path, it would be great?
This is a possibility, but not one we're actively pursuing.
What we are considering is whether or not we should resolve import b
to b.py
in the case where it otherwise wouldn't resolve to anything. Doing so would likely make things work in the above case. The import of logging
would still resolve to the standard library module, however (since we want modules to resolve uniquely).
Alternatively, we could extend the heuristics for what constitutes a potential entry point, e.g. looking for the presence of a sufficient amount of stateful code at the top level.
Thanks for good explanations. Should we close this issue ?
I'm fine with the issue staying open. It might be helpful to others encountering the same issue (until we merge a fix).
add these files under same main directory
a.py
b.py
Run the following simple taint tracking path query
You can see that flow can't go into
b.cmd_exec
froma.py
@jketema on the GHSL slack said that if I change the directory structure as follows:It works, and I tested it, and it has worked for me, and I could track
cmd
toexec_mymodule
as the final Sink.