Closed moon-ligh closed 7 months ago
Hi @moon-ligh,
There could be a number of reasons, though honestly I don't think your detrended result is that bad. Why do you think it's bad - have you done further evaluation with respect to some other device that can provide a source of ground truth? Without seeing your video itself, or at least the general conditions in your video (e.g., motion, lighting) if you want to blur out the face a bit before sharing here, it's very hard to tell what might be going on.
Hi@yahskapar ,
Thank you for your reply,I use a camera to record videos at a frame rate of 25fps. During the recording process, I try to keep the head as different as possible. After cropping the human face, the resolution is scaled to 72 * 72, like this. Using the model weights provided in the paper to predict BVP and obtain one-dimensional signals.In the subsequent classification task of emotion recognition for these one-dimensional signals, but the one-dimensional signal I predicted is not as "beautiful" and "standard" as in the paper.Do you think this is feasible?
Hi @moon-ligh,
That exemplary image from one of your preprocessed videos seems fine to me.
I also think your model output from before is fine - ideally you should try multiple models (e.g., maybe TS-CAN and PhysFormer as well) to see how the predicted signal may differ. Also note there are different EfficientPhys models trained on different datasets as a part of this toolbox's public release of pre-trained models - have you tried all four of them for EfficientPhys (e.g., the ones trained on UBFC-rPPG, PURE, SCAMPS, or MA-UBFC)? It's totally possible one of those models may perform better than the others if the data domain ends up being similar enough to your data. You will really need a source of ground truth to compare to in order to evaluate your self-collected data rigorously - is this something you have?
Aside from this section of the README, I don't think we provided qualitative examples of model outputs in the paper, so I'm curious what you might be referring to as "beautiful" or "standard". Depending on the approach to emotion recognition, I don't see why such a signal as the one you have couldn't be used - though I should note I'm not familiar with specific details of those approaches.
@yahskapar Thank you for your suggestion. I mentioned earlier that the waveform I predicted may not be "standard" enough, which is different from what was mentioned in the paper. What I mean is that the waveform sequence may look a bit messy, for example: The above is the prediction result on my own video, with a time span of about 7 seconds. Compared with the results given in the following paper, it is found that the wave peak in the above image has a depression, such as in the first and fourth seconds; On the following image, this situation basically does not occur. Is this normal?
This should be fine - again, my recommendation is you consider comparing it to a source of ground truth rather than analyzing specific signal features from just a model prediction. It's possible those notches are just a form of noise, or maybe the dicrotic notch. The latter is a bit more difficult to distinguish in my opinion, but you could try taking the derivative of the signal that you have (which should be the first-order derivative of the rPPG signal) to further analyze the signal. A somewhat decent visual example of first-order versus second-order, as well as the systolic versus dicrotic peaks, can be found here.
Is your ultimate goal to use a pre-trained model as a part of a larger emotion recognition pipeline? If so, consider just going ahead and using whatever ground truth you may have for the emotion recognition aspect if you cannot get ground truth for these self-collected videos that you are focusing on. If you really think the rPPG portion of your pipeline is causing issues down the line, you can always try out things like training from scratch or fine-tuning with weak supervision or collecting ground truth (non-trivial) for supervised training.
EDIT: By the way, I also recommend visualizing the frequency domain of your signal. That should give you a good idea of where the most prominent peak is corresponding to the average heart rate, and may also give you an idea for where more significant noise may be.
@yahskapar yes,I want to try using the predicted BVP signal as part of the emotion recognition pipeline.I will carefully consider your suggestion.I really appreciate you taking the time to explain these to me. thanks!
No problem - I'm assuming you have no further questions for the time being, so I'll go ahead and close this issue. If you happen to run into any more problems or have new questions, feel free to create a new issue.
Why is the output result very chaotic, waveform very chaotic, and sharp when using the model weight provided in the paper to estimate the BVP of one's recorded video.Just like this What should I do in this situation, author?