neuropsychology / NeuroKit.py

A Python Toolbox for Statistics and Neurophysiological Signal Processing (EEG, EDA, ECG, EMG...).
http://neurokit.rtfd.io
MIT License
363 stars 102 forks source link

HRV has an algorithmic issue - ecg_hrv #10

Closed bbonik closed 7 years ago

bbonik commented 7 years ago

There is an omission in the heart rate variability calculations in function ecg_hrv(...) that may potentially affect the results, especially for the spectral domain methods.

The calculation of HRV is based on the RR intervals (rri variable in the code). However, rri is indexed at every heartbeat, so it is not uniformly sampled. A common practice is to interpolate the original rri signal and then resample it uniformly. Every subsequent HRV computation should occur in the resampled interpolated rri signal, not the original one that was computed just by subtracting adjacent RRs. If this does not happen, and the HRV computations are based on the original rri, the spectrum will not be a function of frequency but instead it will be a function of cycles per heartbeat.

The following papers decribe this issue in detail:

https://www.researchgate.net/publication/225086966_Role_of_Editing_of_R-R_Intervals_in_the_Analysis_of_Heart_Rate_Variability

http://www.apsipa.org/proceedings_2014/Data/paper/1366.pdf

Inside the ecg_hrv function, which is included in bio_ecg.py, in lines 563-565, there is an attempt to resample the raw rri values. However, what is done is only a scaling of RR values to a sampling rate of 1000Hz. It is not an actual inerpolation and resampling.

So to summarize, I think that the following proceedure should be followed for the HRV measures:

  1. calcualtion of original rri (done already): rri=f(heartbeat)
  2. interpolation on the rri using 3rd order splines: rri_interp
  3. resample rri_interp (perhpas to 1KHz) in order to get rri2=f(t)
  4. all subsequent HRV calculations will be based on rri2
DominiqueMakowski commented 7 years ago

@bbonik Thanks a lot for noticing this issue and suggesting improvements!

So I think you're reffering to these lines within ecg_hrv:

# Resample RRis as if the sampling rate was 1000Hz
rri = rri*1000/sampling_rate
rri = rri.astype(float)

# Preprocessing
outliers = np.array(identify_outliers(rri, treshold=2.58))
rri[outliers] = np.nan
rri = pd.Series(rri).interpolate()

If i understand you correctly, you suggest to change the order of the steps, such as:

# Interpolation using 3rd order splines
rri = pd.Series(rri).interpolate(method="cubic")

# Resample RRis as if the sampling rate was 1000Hz
rri = rri*1000/sampling_rate

# Outliers treatment
outliers = np.array(identify_outliers(rri, treshold=2.58))
rri[outliers] = np.nan
rri = pd.Series(rri).interpolate()

Would that be better? Again, thanks for your comment 😄

DominiqueMakowski commented 7 years ago

Alright, so following your advices I've reworked the RRis preprocessing a bit:

    # Initialize empty dict
    hrv = {}

    # Preprocessing
    # ==================
    # Basic resampling to 1000Hz to standardize the scale
    rri = rri*1000/sampling_rate
    rri = rri.astype(float)

    # Artifact detection - Statistical
    for index, rr in enumerate(rri):
        # Remove RR intervals that differ more than 25% from the previous one
        if rri[index] < rri[index-1]*0.75:
            rri[index] = np.nan
        if rri[index] > rri[index-1]*1.25:
            rri[index] = np.nan

    # Artifact detection - Physiological (http://emedicine.medscape.com/article/2172196-overview)
    rri = pd.Series(rri)
    rri[rri < 600] = np.nan
    rri[rri > 1300] = np.nan

    # Artifact treatment
    hrv["n_Artifacts"] = pd.isnull(rri).sum()/len(rri)
    if artifacts_treatment == "deletion":
        rri = rri.dropna()
    if artifacts_treatment == "interpolation":
        rri = pd.Series(rri).interpolate(method="cubic")  # Interpolate using a 3rd order spline
        rri = rri.dropna()  # Remove first and lasts NaNs that cannot be interpolated

Basically it is a three steps approach, 1) Standardize the scale (to be in milliseconds), 2) finding the artifacts (statistically and physiologically) and 3) artifacts treatment with a new parameter to control if interpolation is used or not. Indeed, some papers (this one) suggest that a simple deletion is a better method... Anyway, I left 3rd order spline interpolation as the default. I've also stored the percentage of detected artifacts.

What do you think?

bbonik commented 7 years ago

I will try to explain what I mean with some code. I specifically use different names for the variables. I substitute rri with RRIbeat to highlight that it is indexed on the heart beats. Later I compute the RRIs as a function of time (not as a function of heartbeat). This is an important step for spectral HRV methods. Same approach is also used in the hrv toolbox.

  % assumed inputs
  % R[ ]: an array of the samples on which an R spike is detected (outputed from the bioSSPy toolbox)
  % frequency: the frequency (in Hz) which was used to sample the original ECG signal

  RRIbeat = np.diff(R) #RR intervals (in samples) indexed per beat
  RRIbeat = RRIbeat.astype(float)
  RRIbeat= RRIbeat/frequency #RR intervals (in sec) indexed per beat

  # your outlier detection code here (statistical and physiological)

  # dealing with outliers, resampling and converting RR intervals to a function of time (not a function of beats)

  outlierIndices=np.argwhere(np.isnan(RRIbeat)) #get the indices of the outliers
  RRIbeat=np.delete(RRIbeat,outlierIndices) #remove the outliers from the RRIbeat signal

  # resampling RRIbeat signal in order to make it a function of time (not a function of beat) 

  beatTime=R[1:] # the time (in samples) at which each beat occured (and thus the RR intervals occured) starting from the 2nd beat
  beatTime=np.delete(beatTime,outlierIndices) #delete also the outlier beat moments
  beatTime=beatTime/frequency # the time (in sec) at which each beat occured, starting from the 2nd beat
  beatTime=beatTime-beatTime[0] #offseting in order to start from 0 sec

  # fit a 3rd degree spline in the RRIbeat data. s=0 guarantees that it will pass through ALL the given points
  spline = interpolate.splrep(x=beatTime, y=RRIbeat, k=3, s=0)

  # RR intervals indexed per time (as oposed to per beat in RRIbeat). Resampled at 1000Hz                                         
  RRI = interpolate.splev(x=np.arange(0,beatTime[-1],1/1000), tck=spline, der=0)

  # plotting

  plt.figure(figsize=(20, 3), dpi=100)
  plt.scatter(np.arange(RRIbeat.shape[0]),RRIbeat,s=1)
  plt.grid(True)
  plt.title('RRI indexed per beat')
  plt.ylabel('RR interval duration (sec)')
  plt.xlabel('#beat')
  plt.tight_layout()
  plt.show()

  plt.figure(figsize=(20, 3), dpi=100)
  plt.plot(RRI,'b-',linewidth=0.5)
  plt.grid(True)
  plt.title('Resampled RRI indexed per time')
  plt.ylabel('RR interval duration (sec)')
  plt.xlabel('Time (sec)')
  plt.tight_layout()
  plt.show()
DominiqueMakowski commented 7 years ago

@bbonik First of all, let me say thank you for taking the time to write some well-documented code, it was a pleasure to read it (even if the most mathematical bits (the spline fitting) remain (and will remain) quite nebulous for me 😆 ).

Now I understand what you mean by "as a function of time", as a sort of continuous signal. I didn't get that the first time.

Well your code works nice, I suggest we keep it. I made a few changes regarding the variable names (to maintain some coherence accross the package).

However, I have noticed something odd. Once I got my HRV "signal" (ie, a continuous signal function of time), I've tried to pass it to the main signal dataframe. However, if I plot the HRV signal with the ECG signal, The HRV seems quite shorter and doesn't go to the last beat (see attached figure). Is that normal?

The code is pasted below and is also, along with the data, available here

import numpy as np
import pandas as pd
import neurokit as nk
import matplotlib.pyplot as plt
import scipy

df = pd.read_csv('normal_ECG.csv')
df = df.loc[10000:20000]  # Select 10s of signal

sampling_rate=1000
df = nk.bio_process(ecg=df["ECG"], sampling_rate=sampling_rate)

# Extract useful things
rri = df["ECG"]['RR_Intervals']
rpeaks = df["ECG"]['R_Peaks']

# Initialize empty dict
hrv = {}

# Preprocessing
# ==================
# Basic resampling to 1Hz to standardize the scale
rri = rri/sampling_rate
rri = rri.astype(float)

# Artifact detection - Statistical
for index, rr in enumerate(rri):
    # Remove RR intervals that differ more than 25% from the previous one
    if rri[index] < rri[index-1]*0.75:
        rri[index] = np.nan
    if rri[index] > rri[index-1]*1.25:
        rri[index] = np.nan

# Artifact detection - Physiological (http://emedicine.medscape.com/article/2172196-overview)
rri = pd.Series(rri)
rri[rri < 0.6] = np.nan
rri[rri > 1.3] = np.nan

# Artifacts treatment
hrv["n_Artifacts"] = pd.isnull(rri).sum()/len(rri)
artifacts_indices = rri.index[rri.isnull()]  # get the indices of the outliers
rri = rri.drop(artifacts_indices)  # remove the artifacts

# resampling rri signal in order to make it a function of time
beat_time = rpeaks[1:]  # the time (in samples) at which each beat occured (and thus the RR intervals occured) starting from the 2nd beat
beat_time = np.delete(beat_time, artifacts_indices)  # delete also the artifact beat moments
beat_time = beat_time/sampling_rate  # the time (in sec) at which each beat occured, starting from the 2nd beat
beat_time = beat_time-beat_time[0]  # offseting in order to start from 0 sec

# fit a 3rd degree spline on the beat_time data.
spline = scipy.interpolate.splrep(x=beat_time, y=rri, k=3, s=0)  # s=0 guarantees that it will pass through ALL the given points

# Get the RR intervals indexed per time
rri_new = scipy.interpolate.splev(x=np.arange(0, beat_time[-1], 1/sampling_rate), tck=spline, der=0)

# CHECK
#==============

# plotting
plt.figure(figsize=(20, 3), dpi=100)
plt.scatter(np.arange(rri_new.shape[0]), rri_new,s=1)
plt.grid(True)
plt.title('RRI indexed per beat')
plt.ylabel('RR interval duration (sec)')
plt.xlabel('#beat')
plt.tight_layout()
plt.show()

plt.figure(figsize=(20, 3), dpi=100)
plt.plot(rri,'b-',linewidth=0.5)
plt.grid(True)
plt.title('Resampled RRI indexed per time')
plt.ylabel('RR interval duration (sec)')
plt.xlabel('Time (sec)')
plt.tight_layout()
plt.show()

data = df["df"][["ECG_Filtered"]].copy()
data["ECG_HRV"] = np.nan
data["ECG_HRV"].ix[rpeaks[0]+1:rpeaks[0]+len(rri_new)] = rri_new

#  nk.z_score(data).plot()
nk.plot_events_in_signal(data, rpeaks)

hrv

bbonik commented 7 years ago

Hi. Some things to take note:

  1. This type HRV signal is used mainly for the frequency domain analysis. Not necessarily for the time domain analysis. This means that you may have 2 HRV signals. One for time domain analysis (rri) one for the frequency domain analysis (rri_new)
  2. The rri signal will always be 1 beat shorter, since it is computed as a difference between heart beats.
  3. After the outlier detection, if it happens that you discard some outliers, the HRV signal (without the outliers) will be shorter compared to the original length of the ECG which contains also the outliers.

Regards

DominiqueMakowski commented 7 years ago

@bbonik I've made the suggested changes. It's way better now, although some HRV indices computation have to be checked by an expert. I'm closing this issue, as the main topic has been addressed. Again, thanks A LOT. Feel free to make any other comments/contributions!