Closed debapratimj closed 4 months ago
From old pAD Detection notes
Point #2
Fringe cases: pAD: TLX221229a, TLX230416b, TLX230511b, TLX230530b Clear cases: TLX221230a, TLX230411a, TLX230416c, TLX230524b, TLX230524c, TLX230622a BAD DETECTION-No pAD: TLX230511a
Plan: Mon, Tues 22.1 - 23.1 (2 days) for Point #1 - Looking at how helpful are the other factors for classifying. If they do not improve classification, we may drop them and instead use the phase portrait method.
Deliverable:
pAD_detection(V_dataframe): ''' Input : V_dataframe : np.array : Voltage dataframe as np array
Current Output :
peak_latencies_all
v_thresholds_all
peak_slope_all
peak_heights_all
pAD_df
'''
Apart from the labels that we assign spikes I think we should have a confidence or measure about the classification also a return to help access fringe cases and where the detector fails (or might probably). will include that 21/22 in the code
why is this function still here ? what difference is tehre between this and the new?
same with this what is it and why it is here - I am having a lot of trouble cleaning your functions in action_potential_functions.py - @debapratimj please assist ASAP
To https://github.com/jjb-hub/IGOR_phd.git 6cc0b52..5b9a2a6 pAD-detector-refinement -> pAD-detector-refinement
cleaned up ap_characteristics_subroutine_linear as it is no longer used, fixed peak_slope to true derivative value around upshoot, bounded ap_width (FWHM) to less than 2 ms.
pAD redundant code here for my book-keeping, following push will just have one pAD_detection() function. (in future we can modularise function into classifiers and other parts)
def pAD_detection_old(V_dataframe, pca_plotting = False , kmeans_plotting = False , histogram_plotting=False):
'''
Input :
V_dataframe : np.array : Voltage dataframe as np array
Output :
ap_slopes : list of lists containing action_potential slopes (num 0f spikes x sweeps analysed)
ap_thresholds : list of lists containing ap_thresholds or turning points of the spike (num 0f spikes x sweeps analysed)
ap_width :
ap_height :
'''
peak_latencies_all_ , v_thresholds_all_ , peak_slope_all_ , peak_locs_corr_, upshoot_locs_all_ , peak_heights_all_ , peak_fw_all_ , peak_indices_all_, sweep_indices_all_ = ap_characteristics_extractor_main(V_dataframe, all_sweeps =True)
# Create Dataframe and append values
pAD_df = pd.DataFrame(columns=['pAD', 'AP_loc', 'AP_sweep_num', 'AP_slope', 'AP_threshold', 'AP_height' , 'AP_latency'])
non_nan_locs = np.where(np.isnan(v_thresholds_all_) == False)[0]
if len (non_nan_locs) == 0 :
pAD_df["AP_loc"] = peak_locs_corr_
pAD_df["AP_threshold"] = v_thresholds_all_
pAD_df["AP_slope"] = peak_slope_all_
pAD_df["AP_latency"] = peak_latencies_all_
pAD_df["AP_height"] = peak_heights_all_
pAD_df['AP_sweep_num'] = sweep_indices_all_
pAD_df["pAD"] = ""
pAD_df["pAD_count"] = np.nan
return peak_latencies_all_ , v_thresholds_all_ , peak_slope_all_ , peak_heights_all_ , pAD_df, None
peak_latencies_all = np.array(peak_latencies_all_)[non_nan_locs]
v_thresholds_all = np.array(v_thresholds_all_ )[non_nan_locs]
peak_slope_all = np.array(peak_slope_all_ )[non_nan_locs]
peak_locs_corr = np.array(peak_locs_corr_ )[non_nan_locs]
upshoot_locs_all = np.array(upshoot_locs_all_)[non_nan_locs]
peak_heights_all = np.array(peak_heights_all_)[non_nan_locs]
peak_fw_all = np.array(peak_fw_all_)[non_nan_locs]
peak_indices_all = np.array(peak_indices_all_)[non_nan_locs]
sweep_indices_all = np.array(sweep_indices_all_)[non_nan_locs]
assert len(peak_latencies_all) == len(peak_slope_all) == len(v_thresholds_all) == len(peak_locs_corr) == len(upshoot_locs_all) == len(peak_heights_all) == len(peak_fw_all) == len(peak_indices_all)== len(sweep_indices_all)
pAD_df["AP_loc"] = peak_locs_corr
pAD_df["AP_threshold"] = v_thresholds_all
pAD_df["AP_slope"] = peak_slope_all
pAD_df["AP_latency"] = peak_latencies_all
pAD_df["AP_height"] = peak_heights_all
pAD_df['AP_sweep_num'] = sweep_indices_all
pAD_class_labels = np.array(["pAD" , "Somatic"] )
if len (peak_locs_corr) == 0 :
print("No APs found in trace")
pAD_df["pAD"] = ""
pAD_df["pAD_count"] = np.nan
return peak_latencies_all , v_thresholds_all , peak_slope_all , peak_heights_all , pAD_df, None
elif len (peak_locs_corr) == 1 :
if v_thresholds_all[0] < -65.0 :
# pAD
pAD_df["pAD"] = "pAD"
pAD_df["pAD_count"] = 1
else:
pAD_df["pAD"] = "Somatic"
pAD_df["pAD_count"] = 0
return peak_latencies_all , v_thresholds_all , peak_slope_all , peak_heights_all , pAD_df, None
else :
pass
def beta_pAD_detection(V_dataframe): ''' Input : V_dataframe : np.array : Voltage dataframe as np array
Output :
ap_slopes : list of lists containing action_potential slopes (num 0f spikes x sweeps analysed)
ap_thresholds : list of lists containing ap_thresholds or turning points of the spike (num 0f spikes x sweeps analysed)
ap_width :
ap_height :
'''
# ESTABLISHED CONVENTION: !!! pAD label == 1 !!!
### STEP 1 : Get the action potential (AP) values and create pAD dataframe with columns and values
peak_latencies_all_ , v_thresholds_all_ , peak_slope_all_ , peak_locs_corr_, upshoot_locs_all_ , peak_heights_all_ , peak_fw_all_ , peak_indices_all_, sweep_indices_all_ = ap_characteristics_extractor_main(V_dataframe, all_sweeps =True)
# Create Dataframe and append values
pAD_df = pd.DataFrame(columns=['pAD', 'AP_loc', 'AP_sweep_num', 'AP_slope', 'AP_threshold', 'AP_height' , 'AP_latency'])
non_nan_locs = np.where(np.isnan(v_thresholds_all_) == False)[0]
if len (non_nan_locs) == 0 :
pAD_df["AP_loc"] = peak_locs_corr_
pAD_df["AP_threshold"] = v_thresholds_all_
pAD_df["AP_slope"] = peak_slope_all_
pAD_df["AP_latency"] = peak_latencies_all_
pAD_df["AP_height"] = peak_heights_all_
pAD_df['AP_sweep_num'] = sweep_indices_all_
pAD_df['upshoot_loc'] = upshoot_locs_all_
pAD_df["pAD"] = ""
pAD_df["pAD_count"] = np.nan
return peak_latencies_all_ , v_thresholds_all_ , peak_slope_all_ , peak_heights_all_ , pAD_df
peak_latencies_all = np.array(peak_latencies_all_)[non_nan_locs]
v_thresholds_all = np.array(v_thresholds_all_ )[non_nan_locs]
peak_slope_all = np.array(peak_slope_all_ )[non_nan_locs]
peak_locs_corr = np.array(peak_locs_corr_ )[non_nan_locs]
upshoot_locs_all = np.array(upshoot_locs_all_)[non_nan_locs]
peak_heights_all = np.array(peak_heights_all_)[non_nan_locs]
peak_fw_all = np.array(peak_fw_all_)[non_nan_locs]
peak_indices_all = np.array(peak_indices_all_)[non_nan_locs]
sweep_indices_all = np.array(sweep_indices_all_)[non_nan_locs]
assert len(peak_latencies_all) == len(peak_slope_all) == len(v_thresholds_all) == len(peak_locs_corr) == len(upshoot_locs_all) == len(peak_heights_all) == len(peak_fw_all) == len(peak_indices_all)== len(sweep_indices_all)
pAD_df["AP_loc"] = peak_locs_corr
pAD_df["upshoot_loc"] = upshoot_locs_all
pAD_df["AP_threshold"] = v_thresholds_all
pAD_df["AP_slope"] = peak_slope_all
pAD_df["AP_latency"] = peak_latencies_all
pAD_df["AP_height"] = peak_heights_all
pAD_df['AP_sweep_num'] = sweep_indices_all
pAD_class_labels = np.array(["pAD" , "Somatic"] )
if len (peak_locs_corr) == 0 :
print("No APs found in trace")
pAD_df["pAD"] = ""
pAD_df["pAD_count"] = np.nan
return peak_latencies_all , v_thresholds_all , peak_slope_all , peak_heights_all , pAD_df
elif len (peak_locs_corr) == 1 :
if v_thresholds_all[0] < -65.0 :
# pAD
pAD_df["pAD"] = "pAD"
pAD_df["pAD_count"] = 1
else:
pAD_df["pAD"] = "Somatic"
pAD_df["pAD_count"] = 0
return peak_latencies_all , v_thresholds_all , peak_slope_all , peak_heights_all , pAD_df
else :
pass
# STEP 2 : Make all labels for upshoot potential < -65 mV = pAD
pAD_df_pAD_via_threshold = pAD_df[pAD_df[ "AP_threshold"] < -65.0 ]
pAD_df_uncertain_via_threshold = pAD_df[pAD_df[ "AP_threshold"] > -65.0 ]
pAD_df_mod = pd.concat([ pAD_df_pAD_via_threshold , pAD_df_uncertain_via_threshold] , axis = 0)
#pAD_df["AP_loc"][np.where(pAD_df["AP_threshold"] < - 65)[0]]
# Data for fitting procedure
X = pAD_df_mod[["AP_slope", "AP_threshold", "AP_height", "AP_latency"]]
# Check if 2 clusters are better than 1
kmeans_1 = KMeans(n_clusters=1, n_init = 1).fit(X)
kmeans_2 = KMeans(n_clusters=2, n_init = 1 ).fit(X)
wcss_1 = kmeans_1.inertia_
wcss_2 = kmeans_2.inertia_
true_labels_combined = None
#print(type(kmeans_2.labels_))
#print(wcss_1, wcss_2)
if wcss_2 < wcss_1 :
true_labels_combined = kmeans_2.labels_
# 2 CLUSTERS are a better fit than 1
# Append label column to type
#print("Mean Voltage Threshold of Cluster 1")
#print(np.nanmean( v_thresholds_all [kmeans_2.labels_ == 0] ))
#print("Mean Voltage Threshold of Cluster 2")
#print(np.nanmean( v_thresholds_all [kmeans_2.labels_ == 1] ))
#print(np.nanmean( peak_slope_all [kmeans_2.labels_ == 0] ) , np.nanmean( peak_slope_all [kmeans_2.labels_ == 1] ))
if np.nanmean( v_thresholds_all [kmeans_2.labels_ == 0] ) < np.nanmean( v_thresholds_all [kmeans_2.labels_ == 1] ):
pAD_df["pAD"] = pAD_class_labels[np.array(kmeans_2.labels_)]
pAD_df["pAD_count"] = np.sum(kmeans_2.labels_)
else:
pAD_df["pAD"] = pAD_class_labels[np.mod( np.array( kmeans_2.labels_ + 1 ) , 2 )]
pAD_df["pAD_count"] = np.sum(np.mod( np.array( kmeans_2.labels_ + 1 ) , 2 ))
pAD_df["num_ap_clusters"] = 2
else :
true_labels_combined = kmeans_1.labels_
# JUST 1 CLUSTER
if np.nanmean(v_thresholds_all ) <= -65.0 :
print("check file all APs seems pAD")
pAD_df["pAD"] = "pAD"
pAD_df["pAD_count"] = len(v_thresholds_all)
else :
pAD_df["pAD"] = "Somatic"
pAD_df["num_ap_clusters"] = 1
pAD_df["pAD_count"] = 0
# Do the same but for the uncertain bit
X = pAD_df_uncertain_via_threshold [["AP_slope", "AP_threshold", "AP_height", "AP_latency"]]
# Check if 2 clusters are better than 1
kmeans_1 = KMeans(n_clusters=1, n_init = 1).fit(X)
kmeans_2 = KMeans(n_clusters=2, n_init = 1 ).fit(X)
wcss_1 = kmeans_1.inertia_
wcss_2 = kmeans_2.inertia_
true_labels_split = None
#print(type(kmeans_2.labels_))
#print(wcss_1, wcss_2)
if wcss_2 < wcss_1 :
true_labels_split = kmeans_2.labels_
else :
true_labels_split = kmeans_1.labels_
# Print % overlap
flip = False
if pAD_df_pAD_via_threshold.shape[0] > 0 :
if true_labels_combined[0] == 1 :
pass
else:
flip = True
true_labels_ = np.hstack((np.ones(pAD_df_pAD_via_threshold.shape[0]) , true_labels_split ) )
if flip :
true_labels_combined = 1 - true_labels_combined
print(sum( true_labels_ == true_labels_combined ) / len( true_labels_combined ) , pAD_df_pAD_via_threshold.shape[0] / pAD_df.shape[0] )
return true_labels_, true_labels_split, true_labels_combined
def pAD_detection(V_dataframe): ''' Input : V_dataframe : np.array : Voltage dataframe as np array
Output :
ap_slopes : list of lists containing action_potential slopes (num 0f spikes x sweeps analysed)
ap_thresholds : list of lists containing ap_thresholds or turning points of the spike (num 0f spikes x sweeps analysed)
ap_width :
ap_height :
'''
### STEP 1 : Get the action potential (AP) values and create pAD dataframe with columns and values
peak_latencies_all_ , v_thresholds_all_ , peak_slope_all_ , peak_locs_corr_, upshoot_locs_all_ , peak_heights_all_ , peak_fw_all_ , peak_indices_all_, sweep_indices_all_ = ap_characteristics_extractor_main(V_dataframe, all_sweeps =True)
# Create Dataframe and append values
pAD_df = pd.DataFrame(columns=['pAD', 'AP_loc', 'AP_sweep_num', 'AP_slope', 'AP_threshold', 'AP_height' , 'AP_latency'])
non_nan_locs = np.where(np.isnan(v_thresholds_all_) == False)[0]
if len (non_nan_locs) == 0 :
pAD_df["AP_loc"] = peak_locs_corr_
pAD_df["AP_threshold"] = v_thresholds_all_
pAD_df["AP_slope"] = peak_slope_all_
pAD_df["AP_latency"] = peak_latencies_all_
pAD_df["AP_height"] = peak_heights_all_
pAD_df['AP_sweep_num'] = sweep_indices_all_
pAD_df['upshoot_loc'] = upshoot_locs_all_
pAD_df["pAD"] = ""
pAD_df["pAD_count"] = np.nan
return peak_latencies_all_ , v_thresholds_all_ , peak_slope_all_ , peak_heights_all_ , pAD_df
peak_latencies_all = np.array(peak_latencies_all_)[non_nan_locs]
v_thresholds_all = np.array(v_thresholds_all_ )[non_nan_locs]
peak_slope_all = np.array(peak_slope_all_ )[non_nan_locs]
peak_locs_corr = np.array(peak_locs_corr_ )[non_nan_locs]
upshoot_locs_all = np.array(upshoot_locs_all_)[non_nan_locs]
peak_heights_all = np.array(peak_heights_all_)[non_nan_locs]
peak_fw_all = np.array(peak_fw_all_)[non_nan_locs]
peak_indices_all = np.array(peak_indices_all_)[non_nan_locs]
sweep_indices_all = np.array(sweep_indices_all_)[non_nan_locs]
assert len(peak_latencies_all) == len(peak_slope_all) == len(v_thresholds_all) == len(peak_locs_corr) == len(upshoot_locs_all) == len(peak_heights_all) == len(peak_fw_all) == len(peak_indices_all)== len(sweep_indices_all)
pAD_df["AP_loc"] = peak_locs_corr
pAD_df["upshoot_loc"] = upshoot_locs_all
pAD_df["AP_threshold"] = v_thresholds_all
pAD_df["AP_slope"] = peak_slope_all
pAD_df["AP_latency"] = peak_latencies_all
pAD_df["AP_height"] = peak_heights_all
pAD_df['AP_sweep_num'] = sweep_indices_all
pAD_class_labels = np.array(["pAD" , "Somatic"] )
if len (peak_locs_corr) == 0 :
print("No APs found in trace")
pAD_df["pAD"] = ""
pAD_df["pAD_count"] = np.nan
return peak_latencies_all , v_thresholds_all , peak_slope_all , peak_heights_all , pAD_df
elif len (peak_locs_corr) == 1 :
if v_thresholds_all[0] < -65.0 :
# pAD
pAD_df["pAD"] = "pAD"
pAD_df["pAD_count"] = 1
else:
pAD_df["pAD"] = "Somatic"
pAD_df["pAD_count"] = 0
return peak_latencies_all , v_thresholds_all , peak_slope_all , peak_heights_all , pAD_df
else :
pass
# Data for fitting procedure
X = pAD_df[["AP_slope", "AP_threshold", "AP_height", "AP_latency"]]
# Check if 2 clusters are better than 1
kmeans_1 = KMeans(n_clusters=1, n_init = 1).fit(X)
kmeans_2 = KMeans(n_clusters=2, n_init = 1 ).fit(X)
wcss_1 = kmeans_1.inertia_
wcss_2 = kmeans_2.inertia_
#print(type(kmeans_2.labels_))
#print(wcss_1, wcss_2)
if wcss_2 < wcss_1 :
# 2 CLUSTERS are a better fit than 1
# Append label column to type
#print("Mean Voltage Threshold of Cluster 1")
#print(np.nanmean( v_thresholds_all [kmeans_2.labels_ == 0] ))
#print("Mean Voltage Threshold of Cluster 2")
#print(np.nanmean( v_thresholds_all [kmeans_2.labels_ == 1] ))
#print(np.nanmean( peak_slope_all [kmeans_2.labels_ == 0] ) , np.nanmean( peak_slope_all [kmeans_2.labels_ == 1] ))
if np.nanmean( v_thresholds_all [kmeans_2.labels_ == 0] ) < np.nanmean( v_thresholds_all [kmeans_2.labels_ == 1] ):
pAD_df["pAD"] = pAD_class_labels[np.array(kmeans_2.labels_)]
pAD_df["pAD_count"] = np.sum(kmeans_2.labels_)
else:
pAD_df["pAD"] = pAD_class_labels[np.mod( np.array( kmeans_2.labels_ + 1 ) , 2 )]
pAD_df["pAD_count"] = np.sum(np.mod( np.array( kmeans_2.labels_ + 1 ) , 2 ))
pAD_df["num_ap_clusters"] = 2
else :
# JUST 1 CLUSTER
if np.nanmean(v_thresholds_all ) <= -65.0 :
print("check file all APs seems pAD")
pAD_df["pAD"] = "pAD"
pAD_df["pAD_count"] = len(v_thresholds_all)
else :
pAD_df["pAD"] = "Somatic"
pAD_df["num_ap_clusters"] = 1
pAD_df["pAD_count"] = 0
return peak_latencies_all , v_thresholds_all , peak_slope_all , peak_heights_all , pAD_df
Testing functions on dataset revealed that peak_slope calculation is inaccurate as it sometimes returns negative values. Happens due to noise, need to have a smoothed gradient. readapting the code accordingly. But the peak widths seem to be good, max around 2.4 or so, (although clamp set at 2 ms, the latency to which the width is then set to was in that case higher than 2 ms itself.
Update 1: slope calculation
peak_slope corrected, some values were still negative even on smoothing. You can jump back in and test slope and width at this point.
Update 2: for pAD_detection(): New error cropped up but debugging, Not ready yet.
IndexError Traceback (most recent call last) /Users/owner/Desktop/IGOR_phd/debug.ipynb Cell 4 line 2 20 peak_slopes_master += peak_slope_all 21 peak_fw_master += peak_fw_all ---> 24 peak_latencies_all , v_thresholds_all , peak_slope_all , pAD_df = pAD_detection(V_array)
File ~/Desktop/IGOR_phd/module/action_potential_functions.py:805, in pAD_detection(V_dataframe)
793 wcss_2 = kmeans2.inertia
796 if wcss_2 < wcss_1 :
797 # 2 CLUSTERS are a better fit than 1
798 # Append label column to type
(...)
802 #print(np.nanmean( v_thresholds_all [kmeans2.labels == 1] ))
803 #print(np.nanmean( peak_slope_all [kmeans2.labels == 0] ) , np.nanmean( peak_slope_all [kmeans2.labels == 1] ))
--> 805 if np.nanmean( v_thresholds_all [kmeans2.labels == 0] ) < np.nanmean( v_thresholds_all [kmeans2.labels == 1] ):
806 pAD_df["pAD"] = pAD_class_labels[np.array(kmeans2.labels)]
807 pAD_df["pAD_count"] = np.sum(kmeans2.labels)
IndexError: boolean index did not match indexed array along dimension 0; dimension is 531 but corresponding boolean dimension is 530
pAD_detector cleaned now and now working with push "93263a2" , ran through entire dataset, will check for voltage threshold distributions as test: i.e. what's the distribution of Somatic spikes? Any still below -65 mV? What about the threshold of detected (but not ground truth i..e set via threshold) pAD spikes.
As the above comment was unclear:
When I made the last comment, the peak_slope was being set to 0 when negative slopes were being in the data at the point of upshoot. That was later corrected to a np.nan as a value of 0 would bias the analysis. This threw an error in the pAD detection as the kmeans clustering cannot handle nan values. So such occurences were dropped in the pAD detection. With latest push, pAD ran through the entire dataset.
This latest pAD_detection :
Issue for looking at the current pAD_detection. Issue has three parts