Consistency of input for various algorithms contained in OpenDR

passalis commented 2 years ago

Different implementations might receive input in different formats in OpenDR toolkit. For example, the shape of the image input can be either NHWC or NCHW, where N is the number of images, H/W is the height/width of an image, and C denotes its channels. PyTorch defaults to NCHW, while TensorFlow defaults to NHWC. OpenDR should obviously select and stick with one choice to ensure the required homogeneity and transparency between different algorithms. Currently, (as far as I understand) there is no check performed to ensure that the image is supplied in the correct format. However, switching between different formats is easy, so existing implementations can be easily adjusted to account for this difference.

You can find a discussion regarding the benefits of the one or the other here. However, I think this is not relevant to our choice, since we are discussing the interface of the toolkit and not the actual implementation. Given that the two biggest toolkits (PyTorch and TensorFlow) have selected different formats, I think it is enough to select one and stick to our choice.

Apart from the issue regarding the NCHW/NHWC, we should also decide on the channel order (i.e., BGR vs RGB). Unfortunately, this is a much harder issue to detect and sometimes it can go unnoticed since models will correctly perform the feedfoward pass regardless of the used channel order, while in some cases they might even produce the quite good results (channel flip is actually a data augmentation technique!). So, a second choice that we should make is whether the default format will be RGB or BGR. Again, given that you know how the models have been trained, switching between the different formats is easy and we can easily ensure that all implementations receive the data as expected.

Therefore, we should: [ ] Decide between NHWC or NCHW for images [ ] Decide between RGB or BGR for images [ ] Researchers working on Videos/Timeseries and/or other kinds of input provide their opinion on relevant issues for the input.

My recommendation would be to have (N)CHW / BGR format for images, (N)TC format (T denotes time) for timeseries and (N)TCHW (?) / BGR format for videos (if this makes sense, otherwise we should stick to the most common native format to avoid conversions and the corresponding impact on the performance).

LukasHedegaard commented 2 years ago

I believe it would make sense to adopt one of the existing common standards. I do think that most perception algorithms in OpenDR use PyTorch. As far as I can see, the torchvision project uses channel order RGB (see line 251), NCHW for images and NCTHW for videos (see here).

In conclusion, I would vote use this standard, e.g. RGB, NCHW, and NCTHW.

vniclas commented 2 years ago

I agree with @LukasHedegaard to follow the torchvision conventions. I would further propose to implement from_file() functions for the various Data classes, e.g., to ensure that images are loaded as RGB.

passalis commented 2 years ago

I agree, given the discussion, I would also vote to fully follow PyTorch's conventions, given that most algorithms use PyTorch and this seems to accepted standard.

vniclas commented 2 years ago

How do we proceed to ensure that all of the implemented algorithms respect these conventions? It would probably make sense to discuss (or at least announce it) in one of the next meetings.

ad-daniel commented 2 years ago

Please check if you have ensured your tool adheres to the convention in Nikos' email. Namely:

Image: RGB ordering, NCHW Video: RGB ordering, NCTHW

N stands for batch size (if used) C stands for Channels T stands for Time dimension H stands for Height W stands for Width

Perception:

[x] Face Detection
- [x] RetinaFace
[ ] Face Recognition
- [ ] Face recognition
[x] Landmark based facial expression recognition
- [x] Landmark based facial expression recognition
[ ] Human Pose Estimation #162
- [ ] Lightweight open pose #162
[x] Video Activity Recognition
- [x] x3d
[x] Online Video Activity Recognition
- [x] cox3d
[x] Skeleton based Action Recognition
- [x] spatio temporal GCN
[x] Progressive Skeleton based Action Recognition
- [x] progressive spation temporal GCN
[x] Speech recognition
- [x] quadraticselfonn
- [x] matchboxnet
- [x] edgesspeechnets
[x] Heart Anomaly Detection
- [x] Neural bag-of-feature
- [x] Gated Recurrent Unit
[ ] Semantic Scene segmentation
- [ ] BiseNet
[x] Object detection 2d
- [x] Centernet
- [x] SSD
- [x] Yolov3
[ ] Efficient Panoptic Segmentation
- [ ] Panoptic segmentation
[ ] 3d Object Localization
- [ ] Voxel object detection 3d
[ ] Object Tracking 2d/3d
- [ ] Deep sort
- [ ] Fair mot
- [ ] ab3dmot
[ ] Multimodal Human Centric Perception
- [ ] RGBD hand gesture
[ ] Multimodal 2d object detection
- [ ] GEM
- [ ] Detr

Control:

[x] Mobile manipulation
[ ] Single Demostration Grasping

Planning:

[ ] End-to-End planner

Simulation:

[ ] Human Model Generation

passalis commented 2 years ago

I have just realized that Image.open(image_path) does not follow NCHW conventions. Therefore, most probably all existing tools already work in NHWC instead of NCHW (they already change the channel order internally).

LukasHedegaard commented 2 years ago

Frankly, I did not test this functionality in the tools, I integrated. The tested images were loaded in another manner without the need for dimension transposition. So changing Image.open(image_path) to use the agreed upon NCHW format, would actually remove a potential bug in my case 😅

vniclas commented 2 years ago

This might actually be a bit more confusing than we initially thought. The to_tensor() (torchvision), converting PIL images or NumPy arrays to a tensor, expects an HWC input and outputs CHW as expected by PyTorch. Thus, if we load images as CHW, we have to make people aware that they have to cast back to HWC before using the built-in to_tensor() function in their dataloaders.

vniclas commented 2 years ago

Just to repeat what I also said during the TMT Meeting, we'll have to adjust the OpenDR bridge as well since OpenCV assumes HWC.

passalis commented 2 years ago

I have opened #164 according to the discussion we had during the last meeting. Please note that I also added an opencv() method to easily obtaining the images into a format that can be used for visualization (check the already updated examples to see its usage). I will also update the documentation and update the rosbridge asap.

vniclas commented 2 years ago

Then we can probably close this issue now.

passalis commented 2 years ago

Thank you all for the hard work! I think the interface is much more consistent and easier to use now!

opendr-eu / opendr

Consistency of input for various algorithms contained in OpenDR #131