mosaicml / composer

Supercharge Your Model Training
http://docs.mosaicml.com
Apache License 2.0
5.12k stars 413 forks source link

Assorted Issues #1331

Closed vedantroy closed 2 years ago

vedantroy commented 2 years ago

Environment

  1. loss should specify micro_batch instead of batch
  2. Cannot log multiple losses, allow me to return dictionary of losses
  3. Cannot log things at batch level (not micro-batch level) inside of loss method
  4. (Bug) grad_accum fails with CUDA OOM even though batch_size=1 w/ no grad_accum works
  5. (Bug) Nothing is printed indicated that composer is restarting the forward method when grad_accum="auto" is set to True
hanlint commented 2 years ago

Thanks @vedantroy for reporting these, we'll take a look at each carefully.

For #5 above, you should see something in the logs (e.g.):

INFO:composer.trainer.trainer:CUDA out of memory detected.
Gradient Accumulation increased from 1 -> 2, and the batch
will be retrained.

Let us know if that's not the case!

vedantroy commented 2 years ago

Thanks @vedantroy for reporting these, we'll take a look at each carefully.

For #5 above, you should see something in the logs (e.g.):

INFO:composer.trainer.trainer:CUDA out of memory detected.
Gradient Accumulation increased from 1 -> 2, and the batch
will be retrained.

Let us know if that's not the case!

I do not see that 😔, hence why I added it to the list of issues.

hanlint commented 2 years ago

Sounds good, tagging @mvpatel2000 for auto grad accum.

Curious the use case for logging and the loss within the microbatch. Was this to debug an error? Ideally, microbatching is just an implementation detail, and for convergence purposes just look at the batch loss.

vedantroy commented 2 years ago

@hanlint My trainer has this:

    def loss(self, out, micro_batch):
        mse_loss, vb_loss = self.diffusion.training_losses(
            out.model_out, x_0=out.x_0, x_t=out.x_t, t=out.t, noise=out.noise
        )
        return mse_loss + vb_loss

I want to log both the mse_loss and vb_loss, right now composer only logs the final sum. Ideally, I could return a dictionary like this:

return {  "mse_loss: mse_loss, "vb_loss: vb_loss }

and composer would log both of the losses. This feature would be very useful b/c for debugging purposes, since I need to see how both of my losses change.

Right now, I can't do this. So to work around the issue, I do the following:

        self.logger.data_batch({"mse_loss": mse_loss, "vb_loss": vb_loss})
        return mse_loss + vb_loss

However, logger.data_batch logs once a batch (instead of every micro-batch). That's totally fine by me, except it looks like the logger only logs the last data point on each microbatch instead of doing some sort of reduction.

Also, another bug The CosineAnnealingScheduler does not decrease the learning rate. However, the CosineAnnealingWithWarmupScheduler does.

See:

image

However, my learning rate stays flat: image

ravi-mosaicml commented 2 years ago

Hi there! Thanks for raising these issues -- appreciate it!

loss should specify micro_batch instead of batch

Cannot log things at batch level (not micro-batch level) inside of loss method

Can you elaborate? If I understand correctly, you would like to be able to access the loss from each microbatch?

Cannot log multiple losses, allow me to return dictionary of losses

Noted, we'll see if we can allow custom types (i.e. dictionaries) for the loss output. We do support a tuple for the loss function output. In the meantime, you can do something like:

def loss(self, out, micro_batch):
        mse_loss, vb_loss = self.diffusion.training_losses(
            out.model_out, x_0=out.x_0, x_t=out.x_t, t=out.t, noise=out.noise
        )
        return (mse_loss, vb_loss)

However, logger.data_batch logs once a batch (instead of every micro-batch). That's totally fine by me, except it looks like the logger only logs the last data point on each microbatch instead of doing some sort of reduction.

This is likely because the optimization step (rather than the microbatch step) is given to wandb.log as the step argument (https://docs.wandb.ai/guides/track/log#stepwise-and-incremental-logging). We do not support metric logging at the microbatch level, as the number of microbatches should not affect convergence. As a workaround, you could log each microbatch to a seperate plot.

(Bug) grad_accum fails with CUDA OOM even though batch_size=1 w/ no grad_accum works

Can you share the stack trace? CC @mvpatel2000

(Bug) Nothing is printed indicated that composer is restarting the forward method when grad_accum="auto" is set to True

Noted, we'll make sure the print statement is always visible (currently it requires the log level to be configured correctly). CC @mvpatel2000.

Also, another bug The CosineAnnealingScheduler does not decrease the learning rate. However, the CosineAnnealingWithWarmupScheduler does.

Noted, thanks! We'll take a look.

mvpatel2000 commented 2 years ago

Thanks for bringing this up! Feedback is super helpful.

(Bug) Nothing is printed indicated that composer is restarting the forward method when grad_accum="auto" is set to True

We just merged in a PR resolving this raising the log level so you should always see it in stdout. Please let me know if this doesn't work!

(Bug) grad_accum fails with CUDA OOM even though batch_size=1 w/ no grad_accum works

If you could send the logs and trace from this, that'd be awesome. I can take a look and fix any issues

abhi-mosaic commented 2 years ago

Hey @vedantroy , I'm having a bit of trouble reproducing the LR issue. Could you try running it again and printing the logs to console via ProgressBarLogger(progress_bar=False, log_to_console=True, log_level='BATCH') and report back what the raw float values for lr are?

I'm trying to figure out if there is something missing in my testing, or if it's just WandB's logging of float values that is truncated... Like in the early stage of a cosine decay, the true value might be 0.9999912.. but then maybe it's just getting displayed as 1.0.

vedantroy commented 2 years ago

@abhi-mosaic, I turned on ProgressBarLogger, but I did not find any difference in the console output.

vedantroy commented 2 years ago

Thanks for bringing this up! Feedback is super helpful.

(Bug) Nothing is printed indicated that composer is restarting the forward method when grad_accum="auto" is set to True

We just merged in a PR resolving this raising the log level so you should always see it in stdout. Please let me know if this doesn't work!

(Bug) grad_accum fails with CUDA OOM even though batch_size=1 w/ no grad_accum works

If you could send the logs and trace from this, that'd be awesome. I can take a look and fix any issues

Ahhh, I can't reproduce the bug anymore, as the codebase has evolved and I don't know which version of the codebase caused that particular bug. :/

vedantroy commented 2 years ago

I don't care about logging things at a micro batch level. I was saying that the parameter name batch in the loss method should be changed to micro_batch to be more descriptive of what it actually is.

@ravi-mosaicml

hanlint commented 2 years ago

@mvpatel2000 for the Auto grad accum issue, it may be that we need to do a cuda cache clear after the grad accum adjustment, otherwise the repeated restarts can cause memory fragmentation on the GPU. This can be triggered by increasing the image resolution such as a known small batch size fits into memory, then greatly increasing the batch size to force multiple grad accum adjustments.

mvpatel2000 commented 2 years ago

Noting here that we merged in cache clearing and are seeing far lower rates of cache fragmentation. It looks like this has basically resolved the auto grad accum issues, but please let us know if you are still seeing problems

mvpatel2000 commented 2 years ago

Closing this because it seems resolved -- please feel free to open if you feel any point was not addressed