wala / ML

Eclipse Public License 2.0
25 stars 17 forks source link

Module initialization code not reflected in importing script #202

Closed khatchad closed 2 months ago

khatchad commented 3 months ago

When a module is imported, any code inside __init__.py for that module is supposed to be executed. It's supposed to be module initialization code, but really it can be any arbitrary code. In Java, the concept is analogous to a static initializer that is executed when a class is first loaded.

The results of the initialization code is supposed to be made available to the importing script. Currently, as a result of #163, we are adding artificial code to load up the subpackages so that they are available to importing scripts. However, we don't do anything about the real code that is there; that code may be importing the subpackages explicitly.

Suppose we have the following code in a script:

# tests/GNN/nodes_graph_classfication/train_gcn.py
from nlpgnn.models import GCNLayer

Currently, we are expecting that:

  1. nlpgnn/models/__init__.py is empty.
  2. GCNLayer.py exists in nlpgnn/models.

Each of these is false. There is no such file GCNLayer.py in nlpgnn/models. Furthermore, nlpgnn/models/__init__.py is non-empty; in fact, it has the following code:

# nlpgnn/models/__init__.py
from .GCN import *

In nlpgnn/models/GCN.py we then have class GCNLayer:

# nlpgnn/models/GCN.py
class GCNLayer(tf.keras.Model):

In the IR for nlpgnn/models/__init__.py, we may available the name GCN:

callees of node __init__.py : []

IR of node 42, context CallStringContext: [ com.ibm.wala.FakeRootClass.fakeRootMethod()V@82 ]
<Code body of function Lscript nlpgnn/models/__init__.py>
...
0   global:global script nlpgnn/models/__init__.py = v1<no information>
...
21   v13 = global:global script nlpgnn/models/GCN.py__init__.py [46->660] (line 5)
22   putfield v1.< PythonLoader, LRoot, GCN, <PythonLoader,LRoot> > = v13__init__.py [46->660] (line 5)

But, we don't do that for GCNLayer. Instead, what happens is that the explicit import drops off the face of the earth:

130   v272 = global:global script nlpgnn/models/GCN.py__init__.py [46->660] (line 5) [272=[*]]
131   v274 = fieldref v272.v259:#*           __init__.py [46->660] (line 5) [274=[*]272=[*]]

In other words, while v1 in stored in a global and available to any importing scripts, while v274 isn't. Consequently, the import code in this file is never reflected in the importer. So, in the IR of of tests/GNN/nodes_graph_classfication/train_gcn.py:

callees of node train_gcn.py : [import, range, zip, GradientTape, EarlyStopping, MaskAccuracy, MaskCategoricalCrossentropy, trampoline3, trampoline4, trampoline4, trampoline4, trampoline4, trampoline4, gradient]

IR of node 31, context CallStringContext: [ com.ibm.wala.FakeRootClass.fakeRootMethod()V@60 ]
<Code body of function Lscript tests/GNN/nodes_graph_classfication/train_gcn.py>
..,
102   v268 = global:global script nlpgnn/models/__init__.pytrain_gcn.py [2:0] -> [56:64] [268=[GCNLayer]]
103   v270 = fieldref v268.v254:#GCNLayer    train_gcn.py [2:0] -> [56:64] [270=[GCNLayer]268=[GCNLayer]]

We correctly assign v268, but v270 is empty:

[Node: <Code body of function Lscript tests/GNN/nodes_graph_classfication/train_gcn.py> Context: CallStringContext: [ com.ibm.wala.FakeRootClass.fakeRootMethod()V@60 ], v270] --> []