jjb-hub / IGOR_phd

0 stars 0 forks source link

pAD detector() Refinement #36

Closed debapratimj closed 4 months ago

debapratimj commented 7 months ago

Issue for looking at the current pAD_detection. Issue has three parts

  1. What factors are helping us the most with detection? Do other factors help at all?
  2. When and why does the pAD detection fail?
  3. Plan for refinement
debapratimj commented 7 months ago

From old pAD Detection notes

Point #2

Fringe cases: pAD: TLX221229a, TLX230416b, TLX230511b, TLX230530b Clear cases: TLX221230a, TLX230411a, TLX230416c, TLX230524b, TLX230524c, TLX230622a BAD DETECTION-No pAD: TLX230511a

debapratimj commented 7 months ago

WhatsApp Image 2024-01-22 at 00 55 26

Plan: Mon, Tues 22.1 - 23.1 (2 days) for Point #1 - Looking at how helpful are the other factors for classifying. If they do not improve classification, we may drop them and instead use the phase portrait method.

Deliverable:

  1. Updated pAD_detection code which only accounts for the most helpful factors.
  2. Report on performance with and without the main factor(s) + if we use phase portraits then will move on to that (see attached picture plan)

pAD_detection(V_dataframe): ''' Input : V_dataframe : np.array : Voltage dataframe as np array

Current Output : 
        peak_latencies_all
        v_thresholds_all
        peak_slope_all
        peak_heights_all
        pAD_df  
'''

Apart from the labels that we assign spikes I think we should have a confidence or measure about the classification also a return to help access fringe cases and where the detector fails (or might probably). will include that 21/22 in the code
jjb-hub commented 7 months ago

Screenshot 2024-01-23 at 12 49 04 why is this function still here ? what difference is tehre between this and the new?

jjb-hub commented 7 months ago

Screenshot 2024-01-23 at 12 50 19 same with this what is it and why it is here - I am having a lot of trouble cleaning your functions in action_potential_functions.py - @debapratimj please assist ASAP

debapratimj commented 7 months ago

To https://github.com/jjb-hub/IGOR_phd.git 6cc0b52..5b9a2a6 pAD-detector-refinement -> pAD-detector-refinement

cleaned up ap_characteristics_subroutine_linear as it is no longer used, fixed peak_slope to true derivative value around upshoot, bounded ap_width (FWHM) to less than 2 ms.

debapratimj commented 7 months ago

pAD redundant code here for my book-keeping, following push will just have one pAD_detection() function. (in future we can modularise function into classifiers and other parts)

DJ: this is here so we do not lose your plotting settings for pca, kmeans and his even tho they wont run for me - each needs to be its own simple func like buildMeanAPFig()

and they should be in plotters

def pAD_detection_old(V_dataframe, pca_plotting = False , kmeans_plotting = False , histogram_plotting=False):

'''
Input : 
        V_dataframe   : np.array : Voltage dataframe as  np array

Output : 
        ap_slopes     : list of lists containing action_potential slopes (num 0f spikes x sweeps analysed)
        ap_thresholds : list of lists containing ap_thresholds or turning points of the spike (num 0f spikes x sweeps analysed)
        ap_width      :  
        ap_height     :
'''

peak_latencies_all_ , v_thresholds_all_  , peak_slope_all_  , peak_locs_corr_, upshoot_locs_all_ , peak_heights_all_  , peak_fw_all_ ,  peak_indices_all_, sweep_indices_all_  =  ap_characteristics_extractor_main(V_dataframe, all_sweeps =True)

# Create Dataframe and append values

pAD_df = pd.DataFrame(columns=['pAD', 'AP_loc', 'AP_sweep_num', 'AP_slope', 'AP_threshold', 'AP_height' , 'AP_latency'])

non_nan_locs =  np.where(np.isnan(v_thresholds_all_) == False)[0] 

if len (non_nan_locs) == 0 : 
    pAD_df["AP_loc"] = peak_locs_corr_ 
    pAD_df["AP_threshold"] =  v_thresholds_all_
    pAD_df["AP_slope"] = peak_slope_all_
    pAD_df["AP_latency"] = peak_latencies_all_
    pAD_df["AP_height"] =  peak_heights_all_ 
    pAD_df['AP_sweep_num'] = sweep_indices_all_
    pAD_df["pAD"] = ""
    pAD_df["pAD_count"] = np.nan 

    return peak_latencies_all_ , v_thresholds_all_  , peak_slope_all_  ,  peak_heights_all_ , pAD_df, None

peak_latencies_all = np.array(peak_latencies_all_)[non_nan_locs]
v_thresholds_all   = np.array(v_thresholds_all_ )[non_nan_locs]
peak_slope_all     = np.array(peak_slope_all_  )[non_nan_locs]
peak_locs_corr     = np.array(peak_locs_corr_ )[non_nan_locs]
upshoot_locs_all   = np.array(upshoot_locs_all_)[non_nan_locs]
peak_heights_all   = np.array(peak_heights_all_)[non_nan_locs]
peak_fw_all        = np.array(peak_fw_all_)[non_nan_locs]
peak_indices_all    = np.array(peak_indices_all_)[non_nan_locs]
sweep_indices_all  = np.array(sweep_indices_all_)[non_nan_locs]

assert len(peak_latencies_all) == len(peak_slope_all) == len(v_thresholds_all) == len(peak_locs_corr) == len(upshoot_locs_all) == len(peak_heights_all) == len(peak_fw_all) == len(peak_indices_all)== len(sweep_indices_all) 

pAD_df["AP_loc"] = peak_locs_corr 
pAD_df["AP_threshold"] =  v_thresholds_all 
pAD_df["AP_slope"] = peak_slope_all 
pAD_df["AP_latency"] = peak_latencies_all
pAD_df["AP_height"] =  peak_heights_all 
pAD_df['AP_sweep_num'] = sweep_indices_all
pAD_class_labels = np.array(["pAD" , "Somatic"] )

if len (peak_locs_corr) == 0 :
    print("No APs found in trace")
    pAD_df["pAD"] = ""
    pAD_df["pAD_count"] = np.nan 

    return   peak_latencies_all , v_thresholds_all  , peak_slope_all  ,  peak_heights_all , pAD_df, None 
elif len (peak_locs_corr) == 1 : 
    if v_thresholds_all[0]  < -65.0 :
        # pAD 
        pAD_df["pAD"] = "pAD"
        pAD_df["pAD_count"] = 1 
    else: 
        pAD_df["pAD"]       = "Somatic"
        pAD_df["pAD_count"] = 0 
    return peak_latencies_all , v_thresholds_all  , peak_slope_all  ,  peak_heights_all , pAD_df, None
else : 
    pass

def beta_pAD_detection(V_dataframe): ''' Input : V_dataframe : np.array : Voltage dataframe as np array

Output : 
        ap_slopes     : list of lists containing action_potential slopes (num 0f spikes x sweeps analysed)
        ap_thresholds : list of lists containing ap_thresholds or turning points of the spike (num 0f spikes x sweeps analysed)
        ap_width      :  
        ap_height     :
'''

#  ESTABLISHED CONVENTION:  !!! pAD label == 1   !!!    

### STEP 1 :  Get the action potential (AP) values and create pAD dataframe with columns and values 

peak_latencies_all_ , v_thresholds_all_  , peak_slope_all_  , peak_locs_corr_, upshoot_locs_all_ , peak_heights_all_  , peak_fw_all_ ,  peak_indices_all_, sweep_indices_all_  =  ap_characteristics_extractor_main(V_dataframe, all_sweeps =True)

# Create Dataframe and append values
pAD_df = pd.DataFrame(columns=['pAD', 'AP_loc', 'AP_sweep_num', 'AP_slope', 'AP_threshold', 'AP_height' , 'AP_latency'])

non_nan_locs =  np.where(np.isnan(v_thresholds_all_) == False)[0] 

if len (non_nan_locs) == 0 : 
    pAD_df["AP_loc"] = peak_locs_corr_ 
    pAD_df["AP_threshold"] =  v_thresholds_all_
    pAD_df["AP_slope"] = peak_slope_all_
    pAD_df["AP_latency"] = peak_latencies_all_
    pAD_df["AP_height"] =  peak_heights_all_ 
    pAD_df['AP_sweep_num'] = sweep_indices_all_
    pAD_df['upshoot_loc']  = upshoot_locs_all_ 
    pAD_df["pAD"] = ""
    pAD_df["pAD_count"] = np.nan 

    return peak_latencies_all_ , v_thresholds_all_  , peak_slope_all_  ,  peak_heights_all_ , pAD_df

peak_latencies_all = np.array(peak_latencies_all_)[non_nan_locs]
v_thresholds_all   = np.array(v_thresholds_all_ )[non_nan_locs]
peak_slope_all     = np.array(peak_slope_all_  )[non_nan_locs]
peak_locs_corr     = np.array(peak_locs_corr_ )[non_nan_locs]
upshoot_locs_all   = np.array(upshoot_locs_all_)[non_nan_locs]
peak_heights_all   = np.array(peak_heights_all_)[non_nan_locs]
peak_fw_all        = np.array(peak_fw_all_)[non_nan_locs]
peak_indices_all    = np.array(peak_indices_all_)[non_nan_locs]
sweep_indices_all  = np.array(sweep_indices_all_)[non_nan_locs]

assert len(peak_latencies_all) == len(peak_slope_all) == len(v_thresholds_all) == len(peak_locs_corr) == len(upshoot_locs_all) == len(peak_heights_all) == len(peak_fw_all) == len(peak_indices_all)== len(sweep_indices_all) 

pAD_df["AP_loc"] = peak_locs_corr 
pAD_df["upshoot_loc"] =  upshoot_locs_all   
pAD_df["AP_threshold"] =  v_thresholds_all 
pAD_df["AP_slope"] = peak_slope_all 
pAD_df["AP_latency"] = peak_latencies_all
pAD_df["AP_height"] =  peak_heights_all 
pAD_df['AP_sweep_num'] = sweep_indices_all
pAD_class_labels = np.array(["pAD" , "Somatic"] )

if len (peak_locs_corr) == 0 :
    print("No APs found in trace")
    pAD_df["pAD"] = ""
    pAD_df["pAD_count"] = np.nan 

    return   peak_latencies_all , v_thresholds_all  , peak_slope_all  ,  peak_heights_all , pAD_df
elif len (peak_locs_corr) == 1 : 
    if v_thresholds_all[0]  < -65.0 :
        # pAD 
        pAD_df["pAD"] = "pAD"
        pAD_df["pAD_count"] = 1 
    else: 
        pAD_df["pAD"]       = "Somatic"
        pAD_df["pAD_count"] = 0 
    return peak_latencies_all , v_thresholds_all  , peak_slope_all  ,  peak_heights_all , pAD_df
else : 
    pass

# STEP 2 : Make all labels for upshoot potential < -65 mV   = pAD  

pAD_df_pAD_via_threshold = pAD_df[pAD_df[ "AP_threshold"]  < -65.0 ]
pAD_df_uncertain_via_threshold = pAD_df[pAD_df[ "AP_threshold"]  > -65.0 ]

pAD_df_mod = pd.concat([ pAD_df_pAD_via_threshold , pAD_df_uncertain_via_threshold] , axis = 0) 

#pAD_df["AP_loc"][np.where(pAD_df["AP_threshold"] < - 65)[0]]

# Data for fitting procedure

X = pAD_df_mod[["AP_slope", "AP_threshold", "AP_height", "AP_latency"]]

# Check if 2 clusters are better than 1 

kmeans_1 = KMeans(n_clusters=1, n_init = 1).fit(X)
kmeans_2 = KMeans(n_clusters=2, n_init = 1 ).fit(X)

wcss_1 = kmeans_1.inertia_
wcss_2 = kmeans_2.inertia_

true_labels_combined = None 

#print(type(kmeans_2.labels_))
#print(wcss_1, wcss_2)

if wcss_2 < wcss_1 :
    true_labels_combined = kmeans_2.labels_
    # 2 CLUSTERS  are a better fit than 1  
    # Append label column to type 
    #print("Mean Voltage Threshold of Cluster 1")
    #print(np.nanmean( v_thresholds_all  [kmeans_2.labels_ == 0] ))
    #print("Mean Voltage Threshold of Cluster 2")
    #print(np.nanmean( v_thresholds_all  [kmeans_2.labels_ == 1] ))
    #print(np.nanmean( peak_slope_all  [kmeans_2.labels_ == 0] ) , np.nanmean( peak_slope_all  [kmeans_2.labels_ == 1] ))

    if np.nanmean( v_thresholds_all  [kmeans_2.labels_ == 0] )  < np.nanmean( v_thresholds_all  [kmeans_2.labels_ == 1] ):
        pAD_df["pAD"] = pAD_class_labels[np.array(kmeans_2.labels_)]
        pAD_df["pAD_count"] = np.sum(kmeans_2.labels_)
    else:
        pAD_df["pAD"] = pAD_class_labels[np.mod( np.array( kmeans_2.labels_ + 1 )     , 2  )]
        pAD_df["pAD_count"] = np.sum(np.mod( np.array( kmeans_2.labels_ + 1 )     , 2  ))

    pAD_df["num_ap_clusters"] = 2 

else :
     true_labels_combined = kmeans_1.labels_
     # JUST 1 CLUSTER
     if np.nanmean(v_thresholds_all )  <= -65.0 : 
         print("check file all APs seems pAD")
         pAD_df["pAD"] = "pAD"
         pAD_df["pAD_count"] = len(v_thresholds_all)
     else : 
      pAD_df["pAD"] = "Somatic"
     pAD_df["num_ap_clusters"] = 1
     pAD_df["pAD_count"] = 0 

# Do the same but for the uncertain bit 

X = pAD_df_uncertain_via_threshold [["AP_slope", "AP_threshold", "AP_height", "AP_latency"]]

# Check if 2 clusters are better than 1 

kmeans_1 = KMeans(n_clusters=1, n_init = 1).fit(X)
kmeans_2 = KMeans(n_clusters=2, n_init = 1 ).fit(X)

wcss_1 = kmeans_1.inertia_
wcss_2 = kmeans_2.inertia_

true_labels_split = None 

#print(type(kmeans_2.labels_))
#print(wcss_1, wcss_2)

if wcss_2 < wcss_1 :
    true_labels_split  = kmeans_2.labels_

else :
     true_labels_split = kmeans_1.labels_

# Print % overlap 
flip = False 

if pAD_df_pAD_via_threshold.shape[0] > 0 : 
    if true_labels_combined[0] == 1 :
        pass
    else: 
        flip = True 

true_labels_ = np.hstack((np.ones(pAD_df_pAD_via_threshold.shape[0])  ,  true_labels_split  ) )

if flip :
    true_labels_combined = 1 - true_labels_combined

print(sum( true_labels_  == true_labels_combined  )  / len(  true_labels_combined   )   , pAD_df_pAD_via_threshold.shape[0] /  pAD_df.shape[0]  )

return true_labels_, true_labels_split, true_labels_combined 

def pAD_detection(V_dataframe): ''' Input : V_dataframe : np.array : Voltage dataframe as np array

Output : 
        ap_slopes     : list of lists containing action_potential slopes (num 0f spikes x sweeps analysed)
        ap_thresholds : list of lists containing ap_thresholds or turning points of the spike (num 0f spikes x sweeps analysed)
        ap_width      :  
        ap_height     :
'''

### STEP 1 :  Get the action potential (AP) values and create pAD dataframe with columns and values 

peak_latencies_all_ , v_thresholds_all_  , peak_slope_all_  , peak_locs_corr_, upshoot_locs_all_ , peak_heights_all_  , peak_fw_all_ ,  peak_indices_all_, sweep_indices_all_  =  ap_characteristics_extractor_main(V_dataframe, all_sweeps =True)

# Create Dataframe and append values
pAD_df = pd.DataFrame(columns=['pAD', 'AP_loc', 'AP_sweep_num', 'AP_slope', 'AP_threshold', 'AP_height' , 'AP_latency'])

non_nan_locs =  np.where(np.isnan(v_thresholds_all_) == False)[0] 

if len (non_nan_locs) == 0 : 
    pAD_df["AP_loc"] = peak_locs_corr_ 
    pAD_df["AP_threshold"] =  v_thresholds_all_
    pAD_df["AP_slope"] = peak_slope_all_
    pAD_df["AP_latency"] = peak_latencies_all_
    pAD_df["AP_height"] =  peak_heights_all_ 
    pAD_df['AP_sweep_num'] = sweep_indices_all_
    pAD_df['upshoot_loc']  = upshoot_locs_all_ 
    pAD_df["pAD"] = ""
    pAD_df["pAD_count"] = np.nan 

    return peak_latencies_all_ , v_thresholds_all_  , peak_slope_all_  ,  peak_heights_all_ , pAD_df

peak_latencies_all = np.array(peak_latencies_all_)[non_nan_locs]
v_thresholds_all   = np.array(v_thresholds_all_ )[non_nan_locs]
peak_slope_all     = np.array(peak_slope_all_  )[non_nan_locs]
peak_locs_corr     = np.array(peak_locs_corr_ )[non_nan_locs]
upshoot_locs_all   = np.array(upshoot_locs_all_)[non_nan_locs]
peak_heights_all   = np.array(peak_heights_all_)[non_nan_locs]
peak_fw_all        = np.array(peak_fw_all_)[non_nan_locs]
peak_indices_all    = np.array(peak_indices_all_)[non_nan_locs]
sweep_indices_all  = np.array(sweep_indices_all_)[non_nan_locs]

assert len(peak_latencies_all) == len(peak_slope_all) == len(v_thresholds_all) == len(peak_locs_corr) == len(upshoot_locs_all) == len(peak_heights_all) == len(peak_fw_all) == len(peak_indices_all)== len(sweep_indices_all) 

pAD_df["AP_loc"] = peak_locs_corr 
pAD_df["upshoot_loc"] =  upshoot_locs_all   
pAD_df["AP_threshold"] =  v_thresholds_all 
pAD_df["AP_slope"] = peak_slope_all 
pAD_df["AP_latency"] = peak_latencies_all
pAD_df["AP_height"] =  peak_heights_all 
pAD_df['AP_sweep_num'] = sweep_indices_all
pAD_class_labels = np.array(["pAD" , "Somatic"] )

if len (peak_locs_corr) == 0 :
    print("No APs found in trace")
    pAD_df["pAD"] = ""
    pAD_df["pAD_count"] = np.nan 

    return   peak_latencies_all , v_thresholds_all  , peak_slope_all  ,  peak_heights_all , pAD_df
elif len (peak_locs_corr) == 1 : 
    if v_thresholds_all[0]  < -65.0 :
        # pAD 
        pAD_df["pAD"] = "pAD"
        pAD_df["pAD_count"] = 1 
    else: 
        pAD_df["pAD"]       = "Somatic"
        pAD_df["pAD_count"] = 0 
    return peak_latencies_all , v_thresholds_all  , peak_slope_all  ,  peak_heights_all , pAD_df
else : 
    pass

# Data for fitting procedure

X = pAD_df[["AP_slope", "AP_threshold", "AP_height", "AP_latency"]]

# Check if 2 clusters are better than 1 

kmeans_1 = KMeans(n_clusters=1, n_init = 1).fit(X)
kmeans_2 = KMeans(n_clusters=2, n_init = 1 ).fit(X)

wcss_1 = kmeans_1.inertia_
wcss_2 = kmeans_2.inertia_

#print(type(kmeans_2.labels_))
#print(wcss_1, wcss_2)

if wcss_2 < wcss_1 :
    # 2 CLUSTERS  are a better fit than 1  
    # Append label column to type 
    #print("Mean Voltage Threshold of Cluster 1")
    #print(np.nanmean( v_thresholds_all  [kmeans_2.labels_ == 0] ))
    #print("Mean Voltage Threshold of Cluster 2")
    #print(np.nanmean( v_thresholds_all  [kmeans_2.labels_ == 1] ))
    #print(np.nanmean( peak_slope_all  [kmeans_2.labels_ == 0] ) , np.nanmean( peak_slope_all  [kmeans_2.labels_ == 1] ))

    if np.nanmean( v_thresholds_all  [kmeans_2.labels_ == 0] )  < np.nanmean( v_thresholds_all  [kmeans_2.labels_ == 1] ):
        pAD_df["pAD"] = pAD_class_labels[np.array(kmeans_2.labels_)]
        pAD_df["pAD_count"] = np.sum(kmeans_2.labels_)
    else:
        pAD_df["pAD"] = pAD_class_labels[np.mod( np.array( kmeans_2.labels_ + 1 )     , 2  )]
        pAD_df["pAD_count"] = np.sum(np.mod( np.array( kmeans_2.labels_ + 1 )     , 2  ))

    pAD_df["num_ap_clusters"] = 2 

else :
     # JUST 1 CLUSTER
     if np.nanmean(v_thresholds_all )  <= -65.0 : 
         print("check file all APs seems pAD")
         pAD_df["pAD"] = "pAD"
         pAD_df["pAD_count"] = len(v_thresholds_all)
     else : 
      pAD_df["pAD"] = "Somatic"
     pAD_df["num_ap_clusters"] = 1
     pAD_df["pAD_count"] = 0 

return peak_latencies_all , v_thresholds_all  , peak_slope_all  ,  peak_heights_all , pAD_df  
debapratimj commented 7 months ago

Testing functions on dataset revealed that peak_slope calculation is inaccurate as it sometimes returns negative values. Happens due to noise, need to have a smoothed gradient. readapting the code accordingly. But the peak widths seem to be good, max around 2.4 or so, (although clamp set at 2 ms, the latency to which the width is then set to was in that case higher than 2 ms itself.

image image image

debapratimj commented 7 months ago

Update 1: slope calculation

peak_slope corrected, some values were still negative even on smoothing. You can jump back in and test slope and width at this point. image

Update 2: for pAD_detection(): New error cropped up but debugging, Not ready yet.

IndexError Traceback (most recent call last) /Users/owner/Desktop/IGOR_phd/debug.ipynb Cell 4 line 2 20 peak_slopes_master += peak_slope_all 21 peak_fw_master += peak_fw_all ---> 24 peak_latencies_all , v_thresholds_all , peak_slope_all , pAD_df = pAD_detection(V_array)

File ~/Desktop/IGOR_phd/module/action_potential_functions.py:805, in pAD_detection(V_dataframe) 793 wcss_2 = kmeans2.inertia 796 if wcss_2 < wcss_1 : 797 # 2 CLUSTERS are a better fit than 1
798 # Append label column to type (...) 802 #print(np.nanmean( v_thresholds_all [kmeans2.labels == 1] )) 803 #print(np.nanmean( peak_slope_all [kmeans2.labels == 0] ) , np.nanmean( peak_slope_all [kmeans2.labels == 1] )) --> 805 if np.nanmean( v_thresholds_all [kmeans2.labels == 0] ) < np.nanmean( v_thresholds_all [kmeans2.labels == 1] ): 806 pAD_df["pAD"] = pAD_class_labels[np.array(kmeans2.labels)] 807 pAD_df["pAD_count"] = np.sum(kmeans2.labels)

IndexError: boolean index did not match indexed array along dimension 0; dimension is 531 but corresponding boolean dimension is 530

debapratimj commented 7 months ago

pAD_detector cleaned now and now working with push "93263a2" , ran through entire dataset, will check for voltage threshold distributions as test: i.e. what's the distribution of Somatic spikes? Any still below -65 mV? What about the threshold of detected (but not ground truth i..e set via threshold) pAD spikes.

debapratimj commented 7 months ago

As the above comment was unclear:

When I made the last comment, the peak_slope was being set to 0 when negative slopes were being in the data at the point of upshoot. That was later corrected to a np.nan as a value of 0 would bias the analysis. This threw an error in the pAD detection as the kmeans clustering cannot handle nan values. So such occurences were dropped in the pAD detection. With latest push, pAD ran through the entire dataset.

This latest pAD_detection :

  1. Correctly labels all AP with threshold below -65 mv as True pAD (checked and tested)
  2. Uses kmeans to classify remaining
  3. Still needs further refinement regarding the clustering method, to which I propose Density based Clustering.