Predict box shape directly instead of offsets?

More of a question than an issue really. I was curious - if I'm understanding correctly the network will predict offsets for each anchor box, which in turn will describe a bounding box. This requires lots of conversions (cxcy to xy, encoding, decoding), so would it not be possible to simply train the network to output as [xmin, ymin, xmax, ymax] instead of [offset-x, offset-y, width, height]? If not, what are the issues with this?

In the same vein, is the encoding and decoding of the bounding box only necessary because we need to go from offsets -> bounding box described by offsets?

sgrvinod / a-PyTorch-Tutorial-to-Object-Detection

Predict box shape directly instead of offsets? #79