Closed eatpk closed 7 months ago
Note: this was only applied to the loader part. Before flatten, the struct was Dict[str,Dict[str,tensor.Tensor]]
, and after faltten, it is List[tensor.Tensor]
, with simplified data structure, but needs the metadata of paths= List[tuple(str,str)]
which maps the index of the List to the specific (module_name,log_type)
Performance Note:
analog = AnaLog(project="test")
analog.watch(model)
analog.setup({"log": "grad", "save": "grad", "statistic": "kfac"})
start_log = time.time()
# Influence Analysis
log_loader = analog.build_log_dataloader()
for log in log_loader:
e1, e2 = log
times.append(time.time() - start_log)
I ran a code something that looks like above for MNIST 784 512 + 512 256 + 256 * 10 3-layered model. The average time for 100 runs in the non-flattened was 2.3 seconds whereas the average time for 100 runs of the flattned was 2.1 sec. Not much of an improvement compared to before.. 😭
This is a small change for the data loader, but the difference may grow when the model size gets bigger. Also, combined with https://github.com/sangkeun00/analog/issues/77, we can align the other parts of the code such as buffer_write
etc to get it flattened to a list instead of dicts.
Thanks for your effort! This is a great starting point. Here are a few comments.
FlattenContext
is initiated: Currently, ingredients for FlattenContext
are first defined at:and then, they are called to initialize FlattenContext
in analog.py
as:
However, I think FlattenContext
can be defined during watch
, even when flatten
is not enabled (since this is not a big overhead).
The rationale behind this is that FlattenContext
is conceptually a global thing that needs to be defined only once just like watch
, whereas update
is called multiple times during the logging process. I believe grouping the global operations together is generally a good idea?
paths_shape
: If my review is correct, the only place where paths_shape
is used in your refactored code is in log_loader.py
as:I think it's fine to load the flattened tensor as it is, instead of recovering its original shape. Actually this is probably more desired (at the cost of modifying InfluenceFunction
), as this can improve GPU parallelization/utilization during influence computations. In detail,
test_log
will be extracted without flattening.precondition
will be applied to unflattened test_log
.test_log
using FlattenContext
.test_log
and loaded train_log
.This is also related to your last comment about the relative modest improvement in data IO. By flattening logs, we improve not only data IO efficiency, but also influence computation efficiency. In conclusion, we can completely remove paths_shape
, but only need to remember the order of paths
which can be done during watch
.
Let me know what you think!
Acknowledged. I can update this to be initialized during or before the watch process.
I found that the (n,1) dimension vectors' dot product performance may not be better than (p,q) dimension vectors' dot products(when p*q =n
). I ran those multiple times and seems like the time range is within the margin of error.
So I tried to preserve the shape, as I wanted to minimize the logic change, just building the database as an interface. However, if you think we can just save the flattened numbers only, it makes the code much cleaner I presume.
Hence, I think what I can do is just make the test_log flattened using FlattenContext
making the shape to [1, n*m
] and the train_log would be loaded as [16, n*m
].(batchsize =16)
I am not too worried about converting into single vector, since as I mentioned in point 2, I am trying to make the "data" as an interface.
So data can be in whichever shape, and we can just build an interface between the data input and output from the vector DB. And as you have mentioned, since the data is already stored in flattened array, only keeping track of which part corresponds to which part of the module(with paths
variable), to regenerate what we have saved. So I acknowledge this comment.
I found that the (n,1) dimension vectors' dot product performance may not be better than (p,q) dimension vectors' dot products(when pq =n). I ran those multiple times and seems like the time range is within the margin of error.
This is generally true. What I was meaning is that since we don't separately compute dot product for each module (iterating. through for loop), but do it only once with the global flattened vector, this may be more efficient in utilizing GPU FLOPs. In the context of deep neural networks, maximally utilizing GPU FLOPs is crucial in achieving high throughput.
@sangkeun00 ready for review
Overall, I believe this PR is ready to be merged! Thanks a ton for your effort.
First task to add
flatten
and some refactoring on config.