Add flatten to config - Githubissues

eatpk commented 8 months ago

First task to add flatten and some refactoring on config.

eatpk commented 8 months ago

Note: this was only applied to the loader part. Before flatten, the struct was Dict[str,Dict[str,tensor.Tensor]], and after faltten, it is List[tensor.Tensor], with simplified data structure, but needs the metadata of paths= List[tuple(str,str)] which maps the index of the List to the specific (module_name,log_type)

Performance Note:

    analog = AnaLog(project="test")
    analog.watch(model)
    analog.setup({"log": "grad", "save": "grad", "statistic": "kfac"})

    start_log = time.time()
    # Influence Analysis
    log_loader = analog.build_log_dataloader()
    for log in log_loader:
        e1, e2 = log
    times.append(time.time() - start_log)

I ran a code something that looks like above for MNIST 784 512 + 512 256 + 256 * 10 3-layered model. The average time for 100 runs in the non-flattened was 2.3 seconds whereas the average time for 100 runs of the flattned was 2.1 sec. Not much of an improvement compared to before.. 😭

This is a small change for the data loader, but the difference may grow when the model size gets bigger. Also, combined with https://github.com/sangkeun00/analog/issues/77, we can align the other parts of the code such as buffer_write etc to get it flattened to a list instead of dicts.

sangkeun00 commented 7 months ago

Thanks for your effort! This is a great starting point. Here are a few comments.

Point where FlattenContext is initiated: Currently, ingredients for FlattenContext are first defined at:

https://github.com/sangkeun00/analog/blob/cb1edae4534c93f72e080285d48e47d4f7024c80/analog/logging/logger.py#L64-L67

and then, they are called to initialize FlattenContext in analog.py as:

https://github.com/sangkeun00/analog/blob/cb1edae4534c93f72e080285d48e47d4f7024c80/analog/analog.py#L244-L248

However, I think FlattenContext can be defined during watch, even when flatten is not enabled (since this is not a big overhead).

https://github.com/sangkeun00/analog/blob/cb1edae4534c93f72e080285d48e47d4f7024c80/analog/analog.py#L103

The rationale behind this is that FlattenContext is conceptually a global thing that needs to be defined only once just like watch, whereas update is called multiple times during the logging process. I believe grouping the global operations together is generally a good idea?

Need of paths_shape: If my review is correct, the only place where paths_shape is used in your refactored code is in log_loader.py as:

https://github.com/sangkeun00/analog/blob/cb1edae4534c93f72e080285d48e47d4f7024c80/analog/logging/log_loader.py#L92

I think it's fine to load the flattened tensor as it is, instead of recovering its original shape. Actually this is probably more desired (at the cost of modifying InfluenceFunction), as this can improve GPU parallelization/utilization during influence computations. In detail,

test_log will be extracted without flattening.
precondition will be applied to unflattened test_log.
flatten preconditioned test_log using FlattenContext.
simply apply dot product between flattened test_log and loaded train_log.

This is also related to your last comment about the relative modest improvement in data IO. By flattening logs, we improve not only data IO efficiency, but also influence computation efficiency. In conclusion, we can completely remove paths_shape, but only need to remember the order of paths which can be done during watch.

Compatibility with vector database #35: By default, vector database may receive one single vector for each data. Therefore, having one flattened representation for each data may enjoy an improved compatibility with this in the future as well.

Let me know what you think!

eatpk commented 7 months ago

Acknowledged. I can update this to be initialized during or before the watch process.
I found that the (n,1) dimension vectors' dot product performance may not be better than (p,q) dimension vectors' dot products(when p*q =n). I ran those multiple times and seems like the time range is within the margin of error. So I tried to preserve the shape, as I wanted to minimize the logic change, just building the database as an interface. However, if you think we can just save the flattened numbers only, it makes the code much cleaner I presume. Hence, I think what I can do is just make the test_log flattened using FlattenContext making the shape to [1, n*m] and the train_log would be loaded as [16, n*m].(batchsize =16)
I am not too worried about converting into single vector, since as I mentioned in point 2, I am trying to make the "data" as an interface.

So data can be in whichever shape, and we can just build an interface between the data input and output from the vector DB. And as you have mentioned, since the data is already stored in flattened array, only keeping track of which part corresponds to which part of the module(with paths variable), to regenerate what we have saved. So I acknowledge this comment.

sangkeun00 commented 7 months ago

I found that the (n,1) dimension vectors' dot product performance may not be better than (p,q) dimension vectors' dot products(when pq =n). I ran those multiple times and seems like the time range is within the margin of error.

This is generally true. What I was meaning is that since we don't separately compute dot product for each module (iterating. through for loop), but do it only once with the global flattened vector, this may be more efficient in utilizing GPU FLOPs. In the context of deep neural networks, maximally utilizing GPU FLOPs is crucial in achieving high throughput.

eatpk commented 7 months ago

@sangkeun00 ready for review

sangkeun00 commented 7 months ago

Overall, I believe this PR is ready to be merged! Thanks a ton for your effort.

logix-project / logix

Add flatten to config #78