logix-project / logix

AI Logging for Interpretability and Explainability🔬
Apache License 2.0
74 stars 6 forks source link

Can not run the example of cifar. #108

Closed yaolu-zjut closed 2 months ago

yaolu-zjut commented 2 months ago

I meet a "torch.cuda.OutOfMemoryError" problem when run the code "CUDA_VISIBLE_DEVICES=3 python examples/cifar/compute_influences.py". And I get the following problem: image so how can i fix it? I just set the batchsize to 128 instead of 512, but it does not work.

sangkeun00 commented 2 months ago

Hello @yaolu-zjut

This OOM error most likely occurred as args.lora is set to None by default (https://github.com/logix-project/logix/blob/main/examples/cifar/compute_influences.py#L14). If you don't turn this on, then you are essentially computing per-sample gradients for all modules in your model (which amounts to 2.2M based on your screen shot). You can avoid OOM by further reducing the batch size, but it will still incur a significant amount of storage costs. Overall, I recommend setting lora to random. Let me know if you still encounter any other issues.

Sang

yaolu-zjut commented 2 months ago

Hello @yaolu-zjut

This OOM error most likely occurred as args.lora is set to None by default (https://github.com/logix-project/logix/blob/main/examples/cifar/compute_influences.py#L14). If you don't turn this on, then you are essentially computing per-sample gradients for all modules in your model (which amounts to 2.2M based on your screen shot). You can avoid OOM by further reducing the batch size, but it will still incur a significant amount of storage costs. Overall, I recommend setting lora to random. Let me know if you still encounter any other issues.

Sang

Hi Sang, it really works if I set the lora to random. Thank you vey much!

yaolu-zjut commented 2 months ago

Hi Sang, I meet another problem when I reproduce the example of bert with "python examples/bert/extract_log.py". It shows that: image Is the config.yaml missing? I can not find it in the logix folder

sangkeun00 commented 2 months ago

Hello @yaolu-zjut

Sorry for missing the config.yaml. Here is the config I used for the BERT example. Also, I use this config for most of my experiments.

root_dir: ./logix
logging:
  flush_threshold: 1000000000
  cpu_offload: false
lora:
  init: random (or pca)
  rank: 64

Let me know if you have any other questions!

yaolu-zjut commented 2 months ago

Thanks! It really helps me.

yaolu-zjut commented 2 months ago

Hi Sang, I meet another strange thing. When I modify the logix on my own dataset. I find when I run examples/bert/extract_log.py, it will generate a large file and save it in 'infl' (run = logix.init('infl', config=args.config_path), infl is the project name) So if i want to implement logix on large dataset, what should i do? Continuing growth image image

sangkeun00 commented 2 months ago

Hello @yaolu-zjut,

The main philosophy behind LogIX and LoGra is that we want to convert an influence function problem into a vector similarity search problem. Therefore, we propose to save all training gradients to disk and simply read them at test time for influence score computations. That being said, as you increase the dataset size, the storage required for saving training gradients also increases linearly. The best solution I suggest here is simply upgrading your storage.

If you don't want to save your gradients to disk then you need to recompute training gradients every time you have a new query. You can also do this with LogIX (my collaborator did it at some point), but the code will get a bit longer and uglier.