helme / ecg_ptbxl_benchmarking

Public repository associated with "Deep Learning for ECG Analysis: Benchmarks and Insights from PTB-XL"
GNU General Public License v3.0
195 stars 86 forks source link

Preprocessing problem #8

Closed kirtov closed 3 years ago

kirtov commented 3 years ago

Hello, thank you for the great work, it is very useful!

I am trying to figure out about ECG data preprocessing. Looking at "raw" PTB-XL dataset I see, that the mean value of ECGs are near 0.0 and std are 0.1 - 0.2, so it differs from e.g. ECG by Apple Watch (it amplitude is much greater than in PTB-XL), so I think than PTB-XL ECG was normalized somehow. So, can you, please, clarify about PTB-XL data preprocessing?

helme commented 3 years ago

Hi @kirtov

all wfdb files stored in data/ptbxl/records100/ and data/ptbxl/records500/ are stored as raw millivolts i.e. without any preprocessing. Please note that wfdb.rdsamp already applies adc (analog to digital converter with 16 bit resolution with 1μV/LSB i.e. 1000 A/D units per mV). In addition I think a mean of 0.0 mV and std around 0.1-0.2 mV is expected. What are the units and statistics for Apple Watch? Are they also 12-lead? I really don't know. Nevertheless to circumvent scaling issues it is recommended to always standardize your data sources to mean 0 and std 1.

Please note: Our provided methods for loading raw data use wfdb.rdsamp and pickles numpy arrays as floats. If you want to save space, you could undo adc by calling r = wfdb.rdrecord(path) and store (r.p_signal*r.adc_gain).astype(np.int16) as 16 bit arrays.

I hope this answers you question. Since I'm not aware of a existing issue here, I will close this issue for now.