wala / ML

Eclipse Public License 2.0
23 stars 17 forks source link

Adding duplicate tensor data sources during dataset enumeration #141

Open khatchad opened 5 months ago

khatchad commented 5 months ago

I am seeing in the logs multiple SSA variables corresponding to a single Python variable being added as tensor dataset sources. This might be okay; dataset sources aren't really the same as tensor sources not stemming from datasets. In other words, the dataset holds the tensors, whereas in the non-dataset case, the tensors are generated from some API (e.g., tf.ones()).

But, I'm unsure. It makes the test values look weird (where are there so many tensor variables in a function?). It may also cause confused when we start tracking shape/d-types for TF2 APIs. I don't think it would hurt just to have the final SSA variable be marked as a tensor dataset source (though one could question how we are representing dataset sources, particularly when it comes tracking shapes; should they be separate?).

One could also ask why, in the SSA, are multiple variables representing the same Python variable.

Example

This is what I am seeing:

# Test enumerate. The first element of the tuple returned isn't a tensor.

import tensorflow as tf

def f(a):
    pass

def g(a):
    pass

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])

for step, element in enumerate(dataset, 1):
    f(step)
    g(element)

That's the input. In the logs, there are two SSA variables representing element, namely, v278 and v282:

106   v265 = invokeFunction < PythonLoader, LCodeBody, do()LRoot; > v3,v249,v258:#1 @106 exception:v266tf2_test_dataset11.py [16:21] -> [16:42] [265=[temp 3]3=[enumerate]249=[dataset]]
BB3
107   v269 = new <PythonLoader,Ltuple>@107   tf2_test_dataset11.py [3:0] -> [18:14]
108   v271 = global:global step              tf2_test_dataset11.py [16:4] -> [16:8]
109   fieldref v269.v257:#0 = v271 = v271    tf2_test_dataset11.py [3:0] -> [18:14]
110   v273 = global:global element           tf2_test_dataset11.py [16:10] -> [16:17]
111   fieldref v269.v259:#1 = v273 = v273    tf2_test_dataset11.py [3:0] -> [18:14]
112   v274 = a property name of v265         <no information> [265=[temp 3]]
113   v276 = fieldref v274.v257:#0           tf2_test_dataset11.py [16:4] -> [16:8] [276=[step]]
115   v278 = fieldref v274.v259:#1           tf2_test_dataset11.py [16:10] -> [16:17] [278=[element]]
117   v267 = binaryop(ne) v268:#null , v274  tf2_test_dataset11.py [3:0] -> [18:14]
118   conditional branch(eq, to iindex=-1) v267,v257:#0tf2_test_dataset11.py [3:0] -> [18:14]
BB4
119   v280 = new <PythonLoader,Ltuple>@119   tf2_test_dataset11.py [3:0] -> [18:14]
120   fieldref v280.v257:#0 = v276 = v276    tf2_test_dataset11.py [3:0] -> [18:14] [276=[step]]
121   fieldref v280.v259:#1 = v278 = v278    tf2_test_dataset11.py [3:0] -> [18:14] [278=[element]]
122   v279 = fieldref v265.v280              tf2_test_dataset11.py [3:0] -> [18:14] [265=[temp 3]]
123   v281 = fieldref v279.v257:#0           tf2_test_dataset11.py [16:4] -> [16:8] [281=[step]]
125   v282 = fieldref v279.v259:#1           tf2_test_dataset11.py [16:10] -> [16:17] [282=[element]]

Producing the following corresponding logs:

[INFO] Added dataflow source from tensor dataset: [Node: <Code body of function Lscript tf2_test_dataset11.py> Context: CallStringContext: [ com.ibm.wala.FakeRootClass.fakeRootMethod()V@2 ], v278]:[Empty].
[INFO] Added dataflow source from tensor dataset: [Node: <Code body of function Lscript tf2_test_dataset11.py> Context: CallStringContext: [ com.ibm.wala.FakeRootClass.fakeRootMethod()V@2 ], v282]:[Empty].

We really only need the second one, I think, unless there is some particular reason there are multiple SSA variables representing the same Python variable. But, since v278 is never referenced again, so I would say not.

khatchad commented 5 months ago

I think this is happening because we are (interprocedurally) processing two different types of SSA instructions, those corresponding to for each statements and field reads:

[FINE] Processing instruction: 274 = a property name of 265.
[INFO] Using interprocedural analysis to find potential tensor iterable definition for use: 265 of instruction: 274 = a property name of 265.
[INFO] Added dataflow source from tensor dataset: [Node: <Code body of function Lscript tf2_test_dataset11.py> Context: CallStringContext: [ com.ibm.wala.FakeRootClass.fakeRootMethod()V@2 ], v278]:[Empty].
...
[FINE] Processing instruction: 279 = fieldref 265.280.
[INFO] Using interprocedural analysis to find potential tensor iterable definition for use: 265 of instruction: 279 = fieldref 265.280.
[INFO] Added dataflow source from tensor dataset: [Node: <Code body of function Lscript tf2_test_dataset11.py> Context: CallStringContext: [ com.ibm.wala.FakeRootClass.fakeRootMethod()V@2 ], v282]:[Empty].

I believe we did this to make the analysis more robust to different situations, i.e., there is some other situation where we are reading tensors from datasets but only one of these instruction pops up. But, since dataset reads aren't considered "tensor generators," this might be OK. But, we do add tensor data sources the same way in both cases. I'm unsure of whether they need to be distinguished, but right now at least they aren't seemingly causing a problem.