More of a question than an issue really. I was curious - if I'm understanding correctly the network will predict offsets for each anchor box, which in turn will describe a bounding box. This requires lots of conversions (cxcy to xy, encoding, decoding), so would it not be possible to simply train the network to output as [xmin, ymin, xmax, ymax] instead of [offset-x, offset-y, width, height]? If not, what are the issues with this?
In the same vein, is the encoding and decoding of the bounding box only necessary because we need to go from offsets -> bounding box described by offsets?
More of a question than an issue really. I was curious - if I'm understanding correctly the network will predict offsets for each anchor box, which in turn will describe a bounding box. This requires lots of conversions (cxcy to xy, encoding, decoding), so would it not be possible to simply train the network to output as [xmin, ymin, xmax, ymax] instead of [offset-x, offset-y, width, height]? If not, what are the issues with this?
In the same vein, is the encoding and decoding of the bounding box only necessary because we need to go from offsets -> bounding box described by offsets?