pulselabteam / PulseDB

Other
40 stars 11 forks source link

What is the detailed order of signal pre-processing? #7

Closed ChengHanChiu closed 1 year ago

ChengHanChiu commented 1 year ago

I would like to know the exact pre-processing order.

In the 2.4.2 section of the paper, it is mentioned that "The PPG signal was filtered with a 4th order Chebyshev-II filter at [0.5,8] Hz before presenting to the Elgendi’s algorithm." In section 2.5, it is stated that "After extracting the characteristic points from the records, we selected high-quality segments from the records to form the cleaned PulseDB dataset. Data selection is conducted by dividing each record into 10-s non-overlapping segments, and determining whether to include or discard each segment."

Based on this information, I understand that the correct order is PPG_Raw -> PPG_F -> 10s segment. However, it seems that the precise timing of normalization is not mentioned. In the Supplementary Material, I found the following description: "The amplitude of ECG Raw and PPG Raw signals were linearly remapped between 0 and 1," and "These raw signals can be filtered with user-defined settings to be used as inputs or outputs that fit best to the desired BP estimation method." Therefore, I assume the correct order should be PPG_Raw -> PPG_Norm -> PPG_F -> 10s segment, which aligns with my observation that each 10s segment of PPG_F falls within the range of 0 to 1.

I am concerned about the potential data leakage if direct Min-Max normalization is applied to the entire waveform of MIMICIII's original waveform ( all the signals have valid numerical sample values). Particularly, for Group A's CalBased_Test_Subset, would it be better to fit the Min-Max normalizer on the training set data during the training phase and use the fitted normalizer for transforming the test data during the inference phase?

I hope I have clearly described the points that confuse me. Thanks

WeinanWang-RU commented 1 year ago

The procedure of processing ECG and PPG in PulseDB is:

  1. Retrieve record with raw signals from MIMIC/VitalDB.
  2. Apply filters to the raw record to yield filtered signal (but still keep the raw signal). Think of it as adding new channels to each record: when downloaded from MIMIC/VitalDB, each record has only 2 channels: raw ECG and PPG. Now via filtering you have totally 4 channels: raw ECG and PPG, filtered ECG and PPG.
  3. Extract characteristic points.
  4. Divide the record into 10s segments. This implies dividing all synced channels to 10s segments. As a result, each 10s segment has both raw and filtered ECG and PPG signals.
  5. Within each segment, remap the raw and filtered ECG and PPG signals between 0 and 1.
  6. Select segments based on evaluation of signal quality

There is no data leakage, since given a new 10-s segment for testing, you can always remap the signal within the segment between 0 and 1 before using it as the input of the model.

ChengHanChiu commented 1 year ago

Very clear explanation, thank you very much!