Regarding the resolution of heart rate

408550969 commented 5 months ago

When I set fs to 30 and continuously sample 180 frames of data, according to the function: Def_calculate_fft_hr (ppg_signal, fs=60, low_pass=0.75, high_pass=2.5): "Calculate heart rate based on PPG using Fast Fourier transform (FFT)." Ppg_signal=np. expand_dims (ppg_signal, 0) N=next_powerOf_2 (ppg_signal. shape [1]) F_ppg, pxx_ppg=scipy. signal. seriodogram (ppg_signal, fs=fs, nfft=N, detrend=False) Fmask_ppg=np. argwhere (f_ppg>=low_pass)&(f_ppg<=high_pass)) Mask_ppg=np. take (f_ppg, fmask_ppg) Mask_pxx=np. take (pxx_ppg, fmask_ppg) Fft hr=np. take (mask_ppg, np. argmax (mask_pxx, 0)) [0] 60 return fft_hr The resolution of heart rate is 7. Now I want to increase the resolution of heart rate. My approach is to directly set N to 2048 when fs is 30. According to the formula 30/2048 60, the resolution of heart rate is less than 1, and the actual measurement is the same. May I ask if I am doing this correctly?

girishvn commented 5 months ago

Hi @408550969 the HR beat frequency resolution with fs = 30Hz, and 180 frames (samples) is 30 / 180 hz, giving a frequency resolution of 0.1667Hz per frequency bin.

I would avoid upsampling to improve frequency resolution. This is a fallacy and can be deterministically predicted given the interpolation method you use. This WILL NOT introduce additional real/useful frequency content to your signal.

Also note, that longer observation windows (etc.) 2048 samples (~70 seconds), will provide a single measure of HR for that minute long span (albeit with higher frequency resolution). This is the average HR for that observation period, and begins to lose its utility when averaged over too long a time period.

408550969 commented 5 months ago

Assuming I develop an application for heart rate detection, it is unacceptable for users to stare at the camera for up to one minute when using it. Is there any other way besides increasing the length of N (the FPS of the camera cannot be modified)?

girishvn commented 5 months ago

I would not say that it is unacceptable. Rather I would say that a more-real-time estimate is preferred. It is a trade off between the length of the capture required, and the freq. resolution / SNR.

I think a minute span is reasonable, though it may be possible to reduce the interaction to ~30 seconds. It is also worth noting that higher frequency resolution also does not always equate to a more useful heart rate (eg. clinically 71 and 71.89 are essentially the same).

girishvn commented 5 months ago

To capture more sophisticated features like pulse wave morphology a higher sampling rate my be required, but to capture the heart rate 30 FPS is fine. I would argue that sub-beat frequency resolution is more than enough in a practical sense.

408550969 commented 5 months ago

Thanks, I have three more questions:

The default CHUNK-LENGTH is 180, which means the model predicts every 180 frames, and the final 1-minute MAE is obtained by averaging. Assuming the camera is 30 frames per second, the result is the average of 10 MAEs at 180 frames. Is it similar to taking the average of 10 heart rate results at 180 frames for calculating a heart rate for 1 minute?
When predicting 180 frames (i.e. predicting every 6 seconds with N set to 2048), I found that accurate prediction can only be made when the face is stationary, and speaking can cause drastic fluctuations in the results. Is this normal? (My training set includes speech scenes)
Sometimes, when predicting 180 frames, the predicted result may be twice or one-half of the true value. The GPT answer is caused by harmonics, and I used the Welch method as well. So, what post-processing can we do to avoid it or detect that this value is harmonic?

girishvn commented 5 months ago

The current chunk-length is 180 (6 seconds if fs = 30). This is similar, but not that HR != MAE. MAE is the error on the HR estimation. If you are interesting in deriving a single MAE value and single HR estimate for a minute long video, you can change the chunk-length to 1800 (60 seconds).
Yes speaking and motion artifcacts can cause significant loss of performance in results. I would refer to the section Motion Augmented Training in the ReadME to learn about how motion augmentations can be used to improve in these scenarios.
Do you band-pass filter your predicted signal around the HR frequency range? If not this may help. Otherwise it is possible that there are harmonics present. You can try to implement a harmonic comb or similar functionality though it is not always straightforward to do.

408550969 commented 5 months ago

1.Why is the default chunk length set to 180 during training, instead of 1800 or 90? Does this have any impact on accuracy?

2.What is the appropriate ratio for controlling static and moving scenes? My current ratio of static to speech scene duration is 3:1

3.I have already added a band pass filter, and I will consider adding some post-processing to alleviate these issues in the future

4.Another question is, which model is more robust to speaking and motion artifacts? I am currently using EfficientPhys

girishvn commented 5 months ago

A) the amount of VRAM of the gpu you are running on which decides the size of a batch, and thus the level of batch variance. B) an understanding of how difficult the task is. The longer the clip the easer the task as you are averaging heart across more time, and thus are able to average out some noise. 180 was chosen to match existing literature. I would consider 90 to also be a valid reporting metric.
This depends on the dataset - I do not have these number of mind.
As I mentioned - motion augmentations during training are important in helping alleviate motion noise. I would use any of the MA models we have released. For eg. MA-UBFC_efficientphys.pth. These days most models are comparable. So it is important to experiment to figure out works best for your use case.

408550969 commented 5 months ago

Thank you very much for your answer!

ubicomplab / rPPG-Toolbox

Regarding the resolution of heart rate #284