mosaicml / examples

Fast and flexible reference benchmarks
Apache License 2.0
441 stars 125 forks source link

Make MosaicGPT MFU calculator more robust #272

Closed abhi-mosaic closed 1 year ago

abhi-mosaic commented 1 year ago

Rather than cache the self.num_fwd_flops, we only cache self.n_params which is necessary to avoid FSDP sharding issues. And then we calculate the FLOPs dynamically on every forward pass for every batch (and we read the batch's max_seq_len.)