Classification stage with Hand tracking samples

melax / dsamples

depth camera samples

Other

10 stars 2 forks source link

Classification stage with Hand tracking samples #2

Open miguelml99 opened 3 years ago

miguelml99 commented 3 years ago

I am using your hand tracking samples project (actually a forked version of it for Intelsense D400) for my Bachelors' thesis but I have encountered a problem. My idea was to build a hand gesture recognition system that could tell what gesture is the system receiving as an input. I was hoping that the output could provide the label for the gestures' name (or the name of the dataset it belongs to at least), just as in this "dsamples" project but using the hand tracking system instead.

However, from what I have experienced with your hand-tracking project, the output resulting from the classification layer is a series of values describing fingers' angles and hand orientation mainly.

Is there a way in which the system could be trained so that the output of the classification stage provides the label for the gesture dataset name (as if we wanted to find out the gesture category each input belongs to)?? Maybe there's something that I am missing and there is actually a way of doing it.

I have already generated several datasets regarding different hand poses using realtime-annotator.cpp, and I have also tried training the cnn with those datasets simultaneously (with the train-cnn.cpp ), however, I haven't yet found the way so extract those dataset labels from the depth image input of a hand-gesture.

I would really appreciate any help on this topic I'm kind of stuck in this step and have been working on this for several months now.

Thanks in advance

melax commented 3 years ago

Hi, If the set of gestures (hand poses) are visually distinct, then directly training from the images (classic cnn example) is one appropriate way to go. As you are probably well aware already, this approach is probably fine for obvious gestures such as 1 vs 2 fingers held up, but doesn't scale easily when more subtlety is required such as determining which fingers are being held up or how are they held up, or if wanting to be able to classify from a different viewing directions or arm orientations.

So to classify such non trivial gestures, its a good idea to leverage hand pose information. If you try to integrate this as additional output into the classifier within a hand-tracking system, then it wouldn't benefit from the 3d model fitting stage of the hand-tracking pipeline. Instead, you could build a separate classifier that you apply after the hand tracking system outputs its information. For example you could use a couple of fully connected layers, that directly takes joint angles (outputted from hand tracking system) as input, and your neural network model and outputs the gesture category.

Data collection/labelling is often the hard work, which you've already done. You can easily use the data you've already collected. for example, the file: datasets/abrir_y_cerr_dedos/hand_data_1.pose each line is one potential training (or testing) sample. Can probably just use these raw numbers as is for input. the expected output (label) for each of these inputs from this file you want to be "abrir_y_cerr_dedos". to build this, have one output node for each gesture. and the expected value of each output node is 1 if the input is from this gesture's dataset, and 0 otherwise.

does this help?

miguelml99 commented 3 years ago

Hi @melax, first of all, thank you very much for your response, it helped a lot. My thesis' professor and I think it's a really good idea to take the .pose information to generate a new CNN model. I have already tried to generate such a system using a couple of fully connected layers as you suggested (i pushed it into the "Classifier" folder of my project).

Unfortunately, I am having some issues implementing the model. I used the raw values from each frame in the ".pose" files as the CNN input, however, I printed it to check what was the CNN receiving, and for some reason, it didn't match the .pose file where it was coming from. I suspect it has something to do with the "compress" method while loading the file.

Regarding the "train" method in cnn.h, the first argument is just the vector with the raw values of each input, but what about the second? I thought it was the vector with the corresponding label of the current input (for example, for an input belonging to the first category, it will be [1,0,0,0,0,0], if we have 6 categories/nodes), however, the project won't compile under these conditions.

Additionally, I have seen that you used an OpenGL approach (glwin.h) for both of your CNN projects, but, Is it necessary to use one for the training of my model? and, Is it necessary to implement this kind of gesture recognition model in the same .cpp as the realtime-tracker if I want both things to run simultaneously in the future?

Lastly, regarding only your hand-tracking samples project, Is it possible to train the CNN at different times with different datasets using train-cnn.cpp ? I don't know if you have seen it, but I have tried to train several CNNs using the different datasets that I recorded. From what I have experienced it was only possible to train a CNN at once, loading all desired datasets simultaneously. Also, in the readme file you indicate to let the CNN train overnight to see results, but, how do I know if it's over? I am not sure what results to expect in this case, I was letting the training run over 20-30 minutes, and I barely notice any difference between my CNNs and the one that was in the project by default.

Again thank you very very much for taking the time to answer my comments, I really appreciate it, Sorry in advance if this comment is too long. Miguel.

miguelml99 commented 3 years ago

Hi @melax ;

I hope not to bother you. I have been able to solve most of the issues I sent you last month. However, I’m having a bit of trouble with the last steps of my project development and I could use some help.

I built a classifier based on a CNN that took pose values from the handmodel fitting in the tracking pipeline and outputted a vector of probability of right guess for each of the gestures categories. Hence a CNN is trained and obtained in this way, later I used the CNN file generated in your “realtime_hand_tracker” project to try if my model was working. When the CNN was trained with datasets of 2 hand gestures the system worked perfectly. But when increasing the number of datasets (e.g. 3, 5…) during the training, the system stopped working and outputed illogical results when tested with the real-time tracker.

I have tried different couples of gestures datasets, and when trained in pairs the system performs just fine, so the error is not in the datasets. Only in the number of datasets being processed. I suspect it has something to do with how I build the CNN or maybe the way I am adding the datasets to the system in the training process.

Please, If you know what could be causing this error do not hesitate in reaching to me. It is frustrating to get this far into the development of my system and getting stuck for such an error. I got the code cleaner and more understandable to push it in my GitHub repository. In case you could take a look this is the link to the main code of my system.
https://github.com/miguelml99/hand_tracking_samples_D400-master-Miguel/blob/main/Train-Classifier/Classifier.cpp

Thank you very much in advance, Miguel.

melax commented 3 years ago

Hi, I think i see what might be causing you a problem ...

First, when we did this work there didn't exist any commonly available ML packages that worked on PC. there was LUA based Torch which was unix and mac only. various open source offerings didn't have the layer types we needed, nor even the minimal of cpu optimizations to make things fast enough. Hence the (perhaps overly compact) single file cnn.h that you see here. For your ML effort, you could use this (as you did) or you could import into a different ML system. Pros and cons to either approach. while small and easy to integrate, a disadvantage with using the cnn.h here is that there isn't a lot of stackoverflow q&a if you get stuck. really sorry about that.

referring to the code from your .cpp file:

CNN baby_gestures_cnn() { CNN cnn({}); cnn.layers.push_back(new CNN::LConv({ 12, 10, 1 }, { 3, 3, 1, 16 }, { 8, 6, 16 })); cnn.layers.push_back(new CNN::LActivation(8 6 16)); cnn.layers.push_back(new CNN::LMaxPool(int3(8, 6, 16))); ...

Couple things here... the convolution layer, first layer added, has an input size of 12x10, sliding window of 3x3 and output size 8x6. these numbers need to match up to the following formula:

output size = input size - (window size -1)

The cnn.h should probably have had some assert statements that could have caught this issue. so either use 5x5 instead of 3x3 for the window, or use 10x8 instead of 8x6 for the output.

However, i'd suggest something completely different. You're not dealing with a large image. Your input is a vector of 120 floating point numbers. These are not spatially arranged in an image, and certainly not a 12x10 one. You are not concerned with trying to generalize over 2d translations. It doesn't make sense to use a sliding window convolution layer. Your first layer should just be a fully connected layer (same as later ones). Consequently, you wont need pooling layers either. perhaps test with something really small and simple like:

cnn.layers.push_back(new CNN::LFull(120, 32)); cnn.layers.push_back(new CNN::LActivation(32)); cnn.layers.push_back(new CNN::LFull(32, 6)); cnn.layers.push_back(new CNN::LSoftMax(6));

see how well that works, or maybe add another layer and widen the number of internal nodes to see if the system learns better (hopefully without overfitting):

cnn.layers.push_back(new CNN::LFull(120, 128)); cnn.layers.push_back(new CNN::LActivation(128));

cnn.layers.push_back(new CNN::LFull(128, 32)); cnn.layers.push_back(new CNN::LActivation(32)); cnn.layers.push_back(new CNN::LFull(32, 6)); cnn.layers.push_back(new CNN::LSoftMax(6));

miguelml99 commented 3 years ago

Hi Stan @melax , Thank you very much for your answer. I have been trying and testing all day what you suggested since it seems like the best approach.

I tested it in two ways, and I came up with the following:

The first one (in which I am more interested) was trying to apply the classifier over the realtime-tracker.cpp you made. Performing the classification at the same time as the tracking would allow predicting gesture category during the simulation. Unfortunately, under these conditions, the results I obtained were not conclusive (only 2 out of 5 gestures were properly recognized)
The second one, was creating an "offline tester" in which to feed the classifier with testing datasets, and in this situation the recognition was perfect for all gestures!!

I am very happy to see that the "offline tester" works this good. However, it makes no sense that the realtime tracker goes this bad in comparison with the offline tester.

It has to be something related to the I am extracting the angles vector during the tracking simulation. This is how I was doing it:

for (int i = 0; i < htk.handmodel.GetPose().size(); i++){ for (int j = 0; j < 3; j++){ cnn_input.push_back(htk.handmodel.GetPose()[i].position[j]); } for (int j = 0; j < 4; j++){ cnn_input.push_back(htk.handmodel.GetPose()[i].orientation[j]); } cnn_input.push_back(0); } auto cnn_out = cnn2.Eval(cnn_input);

The whole codes are: - realtime-tracker - realtime.clssifier (online tester) - Train-classifier

I will try more to see what's causing the classifier to fail in the realtime tracker. If you have any idea of what it's going wrong, please tell me.

Thanks again, Miguel.

melax commented 3 years ago

for the offline tester, if you're not already doing so, try training with half the data, and testing with the other half. perhaps even frames for training and odd frames (or just use all frames) for testing.

For the performance of the realtime there are many possibilities. A couple of the most likely issues:

Is the realtime tracking working well? If its not, then its giving poor pose information into your gesture classifier. BTW, the stereo depth camera your using is very different than the previous generation realsense. its a better all-around device, but probably not as good for short range hand tracking since there seems to be a lot of saturation (over exposure) and the countours of the hand are not well defined in the depth image. The hand pose tracking cnn provided in the original repo was trained using data from the previous generation camera.
sufficient training data - I peered into the data files you collected and there does seem to be limited coverage (variation within all valid ranges of motion for a gesture). So if the wrist is at one fairly constant angle during the recording of one gesture (eg open hand) and at another mostly constant angle for a different gesture (eg fist), then the CNN training might utilize that irrelevnt info to do its classification instead of primarily focusing on the finger angles. If this is the case then a simple solution might be to zero out the first (i==0) pose (wrist position and orientation) in your input. i.e. first 7 numbers of your input vector. I'm pretty sure the pose information for each later bone is relative to the parent bone. It appears that these relative positions/orientation are what matter.

BTW, i dont have the camera you are using. Also, It looks like you're using a makefile setup and a non-PC system for compiling/running your implementation. I did look over the datasets you collected. Unfortunately, I'm not set up with a system to be able to build and run what you have.

miguelml99 commented 3 years ago

@melax Yes I was already using half the data for training, and the rest in the offline tester, and the results were very accurate. 👍

It could be that my camera is not the best for short ranges. However, the recognition is well performed when the hand is moved slowly (at least the simulation model seems to match depth input). Also, my datasets and tracking cnn were recaptured with my camera, so any error it was making, should have been portrayed in these files as well, right?

However, I think the problem not only is the camera, because it does recognize fist and closed palm well enough, and when doing the rock hand gesture the prediction result is opened palm category, which does not make sense but it does that fairly stable every time.

About your second recommendation, why would the CNN get stuck in the first pose for the classification? I mean should it be reading the whole vector? Anyways I will try ruling out the wrist pose in the input vector, should I do the same with the palm?

Is it wrong that I am using makefile setup?. Do not even worry for not having my camera. Your answers are being far beyond helpful. I really appreciate it. 💯

melax commented 3 years ago

why would the CNN get stuck in the first pose for the classification? I mean should it be reading the whole vector?

The issue is the lack of coverage over the full range of possible inputs present in the training set. Say all the closed fist training samples had the back of the hand toward the camera, but all the open flat hand gestures had the palm toward the camera. The neural network may simply learn that wrist roll distinguishes between fist and open hand and not ever bother to incorporate finger angles into its decision making. If it gets all the training samples correct, then its learned what you asked it to learn.

Note that it is hard for the frame-to-frame physical-skeletal-hand-model-tracking to let you capture and correct track every bone in the hand skeleton from any arbitrary camera point of view. So, limiting yourself to a set of gestures that dont rely (and dont use) wrist/hand relative to camera orientation, then gathering a sufficiently complete dataset should be much easier.

should I do the same with the palm?

If none of your gestures rely on this then, yes, you could try removing (or zero-ing) that part of the input. However, if wrist-to-palm angle is required to distinguish between two gestures, then you'd probably need to leave that in. For example, I did notice one of your gestures looked like flat open hand with palm tilted back from wrist. So, if you also had another gesture with hand flat but tilted forward, and wanted to tell these apart, then the palm pose relative to wrist would be essential and so, in that case, you wouldn't want to remove the second (palm-to-wrist) pose, (pose[1]) from the input vector.

By that same reasoning, if you plan to distinguish any two gesture strictly by their orientation with respect to the camera, such as thumbs up vs thumbs down, then you would need to include pose[0] (wrist pose) in the input vector.

Is it wrong that I am using makefile setup?

BTW nothing wrong with you using the makefile setup. its the right answer since that works best for you and your environment. I just mentioned that I cant compile that project out-of-the-box on my system since i'm not set up with make, clang compiler, etc. no worries.

miguelml99 commented 3 years ago

@melax The issue was not exactly the lack of coverage, the classification algorithm does take into account all hand angles, it does not get stuck into the first pose, the problem was that I made a silly mistake when loading the angles vector input in the realtime tracker. I was inserting a 0 value between each joint pose, thus the classifier could not properly recognize that input, since it was not trained for it. I already fixed that and the system performs now pretty good 👍 . Thank you very much for the support. I am sorry that it was such a simple error in the end. :(

One couple of last things:

Is there any support or library reference for the glwin.h class in which the opengl windows of the repository are based? I wanted to show in the realtime-tracker glwin window a picture of the predicted gesture below the tracking simulation, I've tried it in different ways but haven’t successfully accomplished it.
Regarding the MSE error obtained from the “train” method in cnn.h, why does it always start around 0.4-0.35, what units is it working on? Is it the result of substracting 1 minus the predicted output value for the corresponding gesture?

melax commented 3 years ago

There's a lot of software systems needed to build an entire hand tracking system including machine learning, physical model simulation, data collection, depth camera processing, graphical visualization, and user interface. For any of these, (or sub-component of these), there are various libraries or software packages out there. There was a conscious effort to not impose any particular library or convention onto anyone nor make people install a number of bloatware things they may not want on their system. By minimizing dependencies, the goal was to make it easier to compile and run. PC was primary target, but things were kept portable to enable usage on other operating systems or swap out a subcomponent such as depth camera. A consequence of such minimalism is the code is quite dense, lacks extra features, and, unlike battle-tested off-the-shelf libs, doesn't have any stackoverflow discussions. At the time of development, it wasn't clear which set of libraries would become de facto standards. Had this software been developed/released today, it probably would have made sense to use imgui, nlohmann's json, pytorch (if its available on windows), glfw (for windows as well).

So to answer your questions:

Is there any support or library reference for the glwin.h class in which the opengl windows of the repository are based?

So, unfortunately no. Its meant to be as small as possible, just enough to open up an opengl window and let the 3D graphics programmer do the rest from scratch.

I wanted to show in the realtime-tracker glwin window a picture of the predicted gesture below the tracking simulation, I've tried it in different ways but haven’t successfully accomplished it.

Not a lot of support for that. using OpenGL, one would load the images, create textures for them, and draw them on a quad after drawing everything else. If set up with a better GUI toolkit, such as imgui, it would be clear how to do it and likely only take one line of code.

Regarding the MSE error obtained from the “train” method in cnn.h, why does it always start around 0.4-0.35, what units is it working on? Is it the result of substracting 1 minus the predicted output value for the corresponding gesture?

Interesting question. The units are just whatever the output of the network is, and so expect MSE to be at a comparable scale. Each output node would be expecting a value (eg 0 or 1 for softmax output values), but a random initialized network would have (somewhat) random output (between 0 and 1 for sigmoid or softmax node outputs). I'm not sure what the real distribution would be in practice (might skew toward or away from edges given multiple layers), but a quick experiment using uniformly distributed output between 0 and 1 confirms this gives a MSE of 1/3 (which isn't that far off your observed .35-.4).

#include <iostream>
#include <cmath>
float p2(float x){return x*x;}
int main() {
    float sse=0;
    for(int i=0;i<100;i++)
        sse += p2(((float)i+0.5f)/100.0f);  // assume uniform dist
    float mse = sse/100.0f;  // mean square error n=100
    float rmse = std::sqrt(mse);  // root mean square error
    std::cout << "mse  " << mse << std::endl;
    std::cout << "rmse " << rmse << std::endl;
    return 0;
}

output:

mse  0.333325
rmse 0.577343

Some additional things worth mentioning... The intent of the hand tracking samples release was to help/encourage software developers doing development with depth cameras meanwhile the HW company would focus on the camera development. There wasn't one specific target audience for this repo. Perhaps machine learning team might learn from the physical model components and integrate these ideas into their own ML library. 3D graphics developers likely have their own graphics setup and rigid body simulation, but might want to see how to integrate depth sensing and/or machine learning into their pipeline. Given how many new topics are introduced at once, its quite ambitious for you, a single individual at the undergraduate thesis level, to be taking on such an ambitious endeavor and tackling multiple learning curves at once. Furthermore, the change in depth camera, and library api, must have created an extra obstacle to navigate. Well done. If it helps, feel free to tell your advisor I mentioned that. :)

miguelml99 commented 3 years ago

Hi @melax

First of all, thank you very much for your extended response. You covered all my questions.

Regarding my openGL issue, for the moment, I am just writing the resulting gesture prediction on the GUI next to the 3D hand graphics. If I have enough time before the deadline, I will try to implement the imgui library as you suggest.

I really appreciate that someone as experienced as you are helping me, an undergraduate, on his bachelor’s thesis. I hope one day I get to develop my engineering career in Silicon Valley, as you have. I will make sure to include you in the acknowledgements section of my project. 😊

I feel flattered and honoured by your last comments. I will definitely tell my advisor.

Good luck, and thanks for all 👐 👏