onvif / specs

ONVIF Network Interface Specifications
Other
345 stars 93 forks source link

Analytics Service Spec does not provide enough information to interpret point-coordinate data #410

Closed zachvictor closed 5 months ago

zachvictor commented 8 months ago

Discussed in https://github.com/onvif/specs/discussions/409

Originally posted by **zachvictor** April 1, 2024 # Source aspect ratio and/or resolution needed for object coordinate data in video analytics metadata - Issue 1. Many ONVIF-conformant devices do not observe the Analytics Service Spec with regard to coordinate data. - Issue 2. The ONVIF Analytics Service Spec does not provide enough information to interpret point-coordinate data correctly. Coordinate data needs source resolution or aspect ratio. ## Context Object detections in ONVIF Profile video analytics metadata follow the Analytics Service Specification.[^1] An Object may have a ShapeDescriptor containing a BoundingBox, CenterOfGravity, and/or Polygon containing point-coordinate data.[^2] Per the spec, spatial relations involving coordinates use a [-1, 1]-normalized coordinate system, which places (0, 0) at the origin with numbers increasing toward the top right (like the Cartesian plane).[^3] The spec further defines a Transformation type for transformations between coordinate systems.[^4] In the spec, the example give coordinate data in pixels (presumably) and provide a Transformation to yield normalized coordinates. For example:[^5] ```xml ``` ## Issues 1. **Many ONVIF-conformant devices do not observe the Analytics Service Spec with regard to coordinate data.**   1. **Coordinate data in pixels, no transformation.** Some devices listed as [ONVIF Conformant Products](https://www.onvif.org/conformant-products/) produce object analytics with coordinate data in pixel units, yet they do not provide a Transformation.[^6] Since the XML document does not contain a reference to the source resolution, it is impossible to know how to interpret the coordinate data. The ONVIF Profile M spec has exceedingly minimal requirements for Object classification "if supported".[^7] In consequence, such devices can be "conformant", yet their outputs fall short of the Analytics Service Spec, to the extent that they are unusable. The need either (A) to include a Transformation or (B) to provide the resolution of the image used for object-detection—a piece of information, which, although desirable in this and other contexts, is not part of the ONVIF standard.   2. **Coordinate data normalized, aspect ratio unknown.** Some devices listed as [ONVIF Conformant Products](https://www.onvif.org/conformant-products/) produce object analytics with coordinate data already [-1, 1]-normalized and no Transformation, yet they do not provide the aspect ratio of the source image, so it is impossible to know how to interpret the coordinate data from the XML document alone. It may be that the image used to produce the object detection had a 4:3 aspect ratio, while the profile used for recording or streaming is 16:9, but the client has no indication that the normalized coordinates in the object metadata refer to a different aspect ratio. This produces corrupt coordinates when the attempt is made to render, say, bounding boxes on images from the streaming video, because the transformation from 4:3 to 16:9 crops portions of the top and bottom ("letterbox"), requiring an attendant transformation in coordinate data to omit or clamp coordinates from the cropped area and scale coordinates in the remaining area. As with the previous issue, such devices can be "conformant", yet their outputs fall short of the Analytics Service Spec, to the extent that they are unusable without "outside knowledge."[^8] The Transformation will not help here. One needs the aspect ratio (or resolution) of the image used for object-detection—again, not part of the ONVIF standard.   2. **The ONVIF Analytics Service Spec does not provide enough information to interpret point-coordinate data correctly. Coordinate data needs source resolution or aspect ratio.** Even if the metadata provides a Transformation, without the source image's resolution or aspect ratio, there is not enough information to interpret the coordinates correctly. Pixel units refer to a source resolution. Normalized units refer to a source aspect ratio. So, the standard is lacking: given metadata and video streams of a device that observes the Analytics Service Spec fully and beyond the minimal requirements for conformance, the ONVIF standard does not provide enough information to interpret point-coordinate data correctly.   Example: ![Axis-aspect-ratio-coordinates-issue drawio](https://github.com/onvif/specs/assets/299594/62a9869b-22e6-4c0d-ac8c-091484f6d50b) ## References [^1]: ONVIF™ Analytics Service Specification. Version 23.12. December, 2023. [PDF](https://www.onvif.org/specs/srv/analytics/ONVIF-Analytics-Service-Spec.pdf) [^2]: ibid., 5.3.1 Objects (pp. 12–15), and 5.3.3 Shape descriptor (pp. 16–17). [^3]: ibid., 5.2.2 Spatial Relation (pp. 10–12). [^4]: ibid. I have studied ONVIF metadata produced by a number of devices of manufacturers including Axis, Hanwha, and i-PRO. I have not observed the use of the Transformation type in any of these ONVIF implementations. If you know of any implementation that uses the Transformation type, I would be very grateful for any information you could provide, especially the manufacturer, model, and an example XML document. [^5]: ibid., 5.3.1 Objects, p.13, "Example". [^6]: Search [ONVIF Conformant Products](https://www.onvif.org/conformant-products/) for, e.g., ~~Hanwha PND-A6081RV and Hanwha XND-8083RV~~. _Edited: strike, these are not valid examples._ [^7]: See ONVIF® Profile M Specification, Version 1.0, June 2021. 8.7 Object classification, 8.7.1 Device requirements (if supported). [PDF](https://www.onvif.org/wp-content/uploads/2021/06/onvif-profile-m-specification-v1-0.pdf) [^8]: Search [ONVIF Conformant Products](https://www.onvif.org/conformant-products/) for, e.g., Axis Q1656-LE and Axis P3267-LVE. Axis ARTPEC-8 devices give coordinates already normalized, yet the frames used for object inference are (evidently) the stream characterized by their "Capture Mode": i.e., the lowest-level, highest-resolution stream—which, incidentally, their proprietary APIs describe with an aspect ratio, not a resolution. Even if the default Stream Profile is 16:9, if the Capture Mode is 4:3, then the normalized coordinates always refer to a 4:3 frame.
HansBusch commented 6 months ago

Hi Victor, thanks for the detailed problem description.

The analytics normalized coordinate system bases on the encoded video image as defined by the VideoSource Bounds property.

Devices may stream shapes and bounding boxes in pixel coordinates as long as they provide a normalizing transform. If that is missing the devices does not adhere to the specification even if it passes the test tool which only checks the content of GetSupportedMetadata but not the actual streamed shape information.

Hope this addresses your question.