Output vector not same as mediapipe

vinittjain commented 2 years ago

Firstly, very nice implementation. However, I am facing some issues understanding the output of your work. I ran through your face detection code, and the output of your model is (896,4). However, the original output from mediapipe is (896,16). Can you please tell me how to get the additional 4 keypoints associated with face please? Also, where should I make changes in your code if I want my output to be exactly similar to mediapipe. Thank you!

pntt3011 commented 2 years ago

Hi @vinittech,

In decodeBox function of DetectionPostProcess.cpp, rawBoxes is a vector with length NUM_BOXES NUM_COORD (896 16).

The first four positions are center x, center y, width and height of the box. The remaining 12 positions are x, y of 6 keypoints. You can get them by using rawBoxes[boxOffset + 4 .. 15], rescale (just like cx and cy) and adjust the struct Detection to wrap the results.

vinittjain commented 2 years ago

I see, well noted. Also, I can only get about 8-10FPS on CPU without XNNPack. Is your performance on GPU or CPU with XNNPACK enabled?

pntt3011 commented 2 years ago

My performance is measured on CPU with XNNPACK enabled. The tensorflowlite lib in release is also compiled with XNNPACK delegate.

I'm sorry if the description is misunderstanding.

vinittjain commented 2 years ago

I see, but shouldn't you include the XNNPACK delegate when you initialize your interpreter in the ModelLoader.cpp? Link

pntt3011 commented 2 years ago

Have you tried it and got any performance improvement? When I run my code, the log prints something like "Running XNNPACK delegate", so I think it is enabled by default.

vinittjain commented 2 years ago

Ah, I don't get the log with Running XNNPACK delegate. I have to compile it from source maybe. But as I can see, the performance is very slow on the CPU (about 8 FPS).

pntt3011 commented 2 years ago

@vinittech where did you get the tensorflowlite lib?

vinittjain commented 2 years ago

I am running your code on a ubuntu machine, so I got the tensorflowlite.so from here

pntt3011 commented 2 years ago

You can try downloading the lib from here. I also downloaded from it and it is the same repo you mentioned above.

vinittjain commented 2 years ago

Let me try this one out! Thank you for all your help!

vinittjain commented 1 year ago

I ran your code and everything works fine on CPU but still cannot make it to work on CPU with XNNPack support. However, I have come across another issue. In your "model_loader.cpp" function, the original code is


std::vector<float> my::ModelLoader::loadOutput(int index) const {
    if (isIndexValid(index, 'o')) {

        int n = m_outputs[index].bytes;
        std::vector<float> inference(n);

        memcpy(&(inference[0]), m_outputs[index].data, n);
        return inference;
    }
    return std::vector<float>();
}

When I execute your code, my output vector is 4 times the size of the original model output. For instance, if the vector of model output is (896, 1), then I am getting a vector of (3584, 1) which is 4 times the original output size. I am not sure why this issue happens. To avoid this, I modified your code where I divide the total memory of the vector by the size of the float. Please note that I am running your models on just CPU, not with CPU with XNNPack enabled.


std::vector<float> my::ModelLoader::loadOutput(int index) const {
    if (isIndexValid(index, 'o')) {

        int n = m_outputs[index].bytes;
        int size = sizeof(float);
        std::vector<float> inference(n / size);

        memcpy(&(inference[0]), m_outputs[index].data, n);
        return inference;
    }
    return std::vector<float>();
}

pntt3011 commented 1 year ago

Hi @vinittech, could you tell me which model you are running? If you mean the FaceDetection model, I dont think its original output size is (896,1). That is the number of anchor boxes. Each anchor boxes has more than 1 value (x,y,w,h and 6 (x,y) main facial landmarks). Then it should be (896 x 16, 1). Correct me if i am wrong because i do not have my laptop with me right now.

vinittjain commented 1 year ago

Hi @pntt3011, you are right. The face detector model has 2 outputs - regressor with (896 x 16, 1) and classifier with (896, 1). However, when I run your code, I am getting the regressor vector with (896 x 16 x 4, 1) and classifier vector with (896 x 4, 1). It's just that the output vector is 4 times more than it should be. Second, for example if you take the classifier, the values only exist for first 896 elements, and the remaining (896 x 3) have values of zero. Basically, the model is giving the correct output, it's just that the output vector is 4x the required size, hence I am having to divide the output vector size by sizeof(float) to only retrieve first 896 elements. Similarly, for regressor also, the model gives correct values for first (896 x 16 x 1, 1), and the remaining (896 x 16 x 3, 1) have zero values.

All the models work fine, but it's just that the output vector is 4 times the required size. Diving the vector by sizeof(float) solves the issue, but it' doesn't make sense why it works this way.

pntt3011 commented 1 year ago

@vinittech, oh I completely forgot about the classifier score. Thank you for notice me that. I will re-check the model on my computer later when i get home. When I coded that part, it ran just fine so I thought the size is correct.

vinittjain commented 1 year ago

Sure, thanks for quick replies. Because the remaining values are zeros anyway, you must have missed it but then you can check the size of your output vector and let me know. Also, it doesn't matter much since the remaining values are zeros anyway, but I want to know if it happens to you as well..If it does, then we can just update the code, else I will have to figure why that issue exists for me.

pntt3011 commented 1 year ago

Hello @vinittech, I have re-checked the model on my computer. Just like yours, my output vector is also 4 times in size. However, thanks to the sizeof(float) you mentioned, I notice that in loadOutput function:

n is the size of the model output in BYTES.
inference is FLOAT vector. So it makes sense that we should divide the n by sizeof(float). In memcpy, we still need to use the size in bytes though.

vinittjain commented 1 year ago

@pntt3011 thanks for looking into it. I have no more concerns, so ill close this issue.

pntt3011 / mediapipe_face_iris_cpp

Output vector not same as mediapipe #17