izmailovpavel / understandingbdl

Other
229 stars 38 forks source link

Killed process when using SWAG #25

Closed airton-neto closed 1 year ago

airton-neto commented 1 year ago

Hi there,

I wonder if you could help me with SWAG usage.

Now I'm using the implementation in this repo, but I've tried the one in https://github.com/wjmaddox/swa_gaussian too. It was giving me too narrow predictive distributions, so I tried changing for this one.

Here is my current network, which I've pre-trained using RMSE (training phase ran ok).

self.mlp = nn.Sequential(
    nn.Linear(
        40544,
        1024,
    ),
    nn.Tanh(),
    nn.Dropout(p=0.3),
    nn.Linear(1024, 1024),
    nn.Tanh(),
    nn.Dropout(p=0.3),
    nn.Linear(1024, 112),
)

I am having an issue trying to run SWAG in the trained model. My py command is getting to be killed, maybe because of running out of memory.

I wonder if you could help me to overcome this situation. I thought about changing the code a little to collect only the last layer's parameters. Did not tried this yet.

I can also try SWAG like here, maybe it is a newer code.

Could you help giving me some ideas?

Thanks very much for helping me, and for the work done with the papers, it was great!

izmailovpavel commented 1 year ago

Hey @airton-neto, sorry for a delayed response!

It was giving me too narrow predictive distributions, so I tried changing for this one. The implementations should be quite similar actually, so I am not quite sure why one implementation works for you and the other doesn't.

In any case, it sounds like you are running out of RAM. So you are running training on CPU, not GPU?

I would recommend running htop or something like that while you run the code, and looking at the memory usage. You should see that the memory increases until it runs out, and the process gets killed.

SWAG basically stores several copies of your model, which would increase your memory usage. One way you can try to decrease the memory imprint is by decreasing the --max_num_models parameter in our training script here here. By default, it's 20, which adds 20 * model_size to your memory usage, which by my estimate would use about 3.5Gb of extra memory compared to your standard training. You can try to set this parameter to 10 instead.

You can also try to decrease the batch size.

airton-neto commented 1 year ago

Hi there, thanks for reaching me!

I wound up reducing a little the model complexity and it started working. It stopped running into memory problems.

Thanks for your helping, I appreciate it.

Regards,