Some bugs in `make_mne()` for nlx format

I ran into a few issues trying to load the .ncs files from neuralynx into an mne object. Attaching my notes and potential solutions.

1. `lfp`, `sr`, and `ch_name` are not defined on line 652:

https://github.com/seqasim/LFPAnalysis/blob/bacc1f801f03bf3747746d8421946f80d32bee8c/LFPAnalysis/lfp_preprocess_utils.py#L652-L654

Potential solution: add the following before the nsc_files loop, assuming lfp is a numpy array and the rest are lists (which is what mne.create_info() and mne.io.RawArray() require:

lfp = np.array([])
sr = []
ch_name = []

Then lines 652-654 should be replaced with something like (but see point 3 below for issue with lfp when channels are different sizes):

if lfp.size == 0:
    lfp = fdata['data']
else:
    lfp = np.vstack([lfp, fdata['data']])
sr.append(fdata['sampling_rate']) 
ch_name.append(str(ch_num))

2. Line 642 globs the all `.ncs` files in `load_path` but these might not match the pattern needed for the `ch_num` assignment:

https://github.com/seqasim/LFPAnalysis/blob/bacc1f801f03bf3747746d8421946f80d32bee8c/LFPAnalysis/lfp_preprocess_utils.py#L646

Potential solution 1: If ch_name should be referring to files that contain the pattern _0000.nsc to _9999.nsc (assuming this because this value is being converted to an integer), then maybe add something like this in place of line 642:

pattern = re.compile(r"_\d{4}\.ncs")  # regex pattern to match "_0000.ncs" to "_9999.ncs"
ncs_files = [x for x in glob(f'{load_path}/*.ncs') if re.search(pattern, x)]

If the case, then would also need to change line 646 to ch_num = int(chan_name[-4:])

Potential solution 2: If ch_name should be referring to the base name of each file (without the .nsc extension), then add chan_name.replace(".ncs","") and remove the int() function

3. `mne.create_info()` fails when there are multiple unique values in `sr` and `ch_type` is not defined:

https://github.com/seqasim/LFPAnalysis/blob/bacc1f801f03bf3747746d8421946f80d32bee8c/LFPAnalysis/lfp_preprocess_utils.py#L656-L657

Should ch_type='seeg' here?

Also, when there are multiple unique values in sr, appending to lfp fails in the loop because axis dimensions must be consistent.

Based on the above, here are my potential edits to the function (marked with ### SAR EDIT ###). I think the main issue is point 3 when there are multiple different sampling rates. (I also added a new save_path flag). I could submit a PR, but I'm not very certain about some of these.

def make_mne(load_path=None, elec_data=None, format='edf', save_path=None):
    """
    Make a mne object from the data and electrode files, and save out the photodiode. 
    Following this step, you can indicate bad electrodes manually.

    Parameters
    ----------
    load_path : str
        path to the neural data
    elec_data : pandas df 
        dataframe with all the electrode localization information
    format : str 
        how was this data collected? options: ['edf', 'nlx]

    Returns
    -------
    mne_data : mne object 
        mne object
    """

    # 1) load the data:
    if format=='edf':
        edf_file = glob(f'{load_path}/*.edf')[0]
        mne_data = mne.io.read_raw_edf(edf_file, preload=True)
    elif format =='nlx': 
        ### SAR EDIT ###
        pattern = re.compile(r"_\d{4}\.ncs")  # regex pattern to match "_0000.ncs" to "_9999.ncs"
        ncs_files = [x for x in glob(f'{load_path}/*.ncs') if re.search(pattern, x)]
        lfp = np.array([])
        sr = []
        ch_name = []
        ch_type = 'seeg'
        ### SAR EDIT ###
        for chan_path in ncs_files:
            # This is leftover from Iowa - let's see how channels are named in MSSM data
            chan_name = chan_path.split('/')[-1][:-4]
            ch_num = int(chan_name[-4:]) ### SAR EDIT ### <-- need to confirm with Salman what this is doing
            try:
                fdata = nlx_utils.load_ncs(chan_path)
            except IndexError: 
                print(f'No data in channel {chan_name}')
                continue

            ### SAR EDIT ###
            if lfp.size == 0:
                lfp = fdata['data']
            else:
                lfp = np.vstack([lfp, fdata['data']])
            sr.append(fdata['sampling_rate']) 
            ch_name.append(str(ch_num))
            ### SAR EDIT ###

            unix_time = fdata['time']

        ### SAR NOTE: this fails if sr has more than two values
        info = mne.create_info(ch_name, np.unique(sr), ch_type)
        mne_data = mne.io.RawArray(lfp, info)

    if format=='edf':
        # The electrode names read out of the edf file do not always match those 
        # in the pdf (used for localization). This could be error on the side of the tech who input the labels, 
        # or on the side of MNE reading the labels in. Usually there's a mixup between lowercase 'l' and capital 'I'.

        # Sometimes, there's electrodes on the pdf that are NOT in the MNE data structure... let's identify those as well. 
        new_mne_names, _, _ = match_elec_names(mne_data.ch_names, elec_data.label)
        # Rename the mne data according to the localization data
        new_name_dict = {x:y for (x,y) in zip(mne_data.ch_names, new_mne_names)}
        mne_data.rename_channels(new_name_dict)

    right_seeg_names = [i for i in mne_data.ch_names if i.startswith('r')]
    left_seeg_names = [i for i in mne_data.ch_names if i.startswith('l')]
    sEEG_mapping_dict = {f'{x}':'seeg' for x in left_seeg_names+right_seeg_names}

    mne_data.set_channel_types(sEEG_mapping_dict)

    # 3) Identify line noise
    mne_data.info['line_freq'] = 60

    # Notch out 60 Hz noise and harmonics 
    mne_data.notch_filter(freqs=(60, 120, 180, 240))

    # 4) Save out the photodiode channel separately
    ### SAR EDIT ###
    if save_path is None:
        mne_data.save(f'{load_path}/photodiode.fif', picks='dc1', overwrite=True)
    else:
        mne_data.save(f'{save_path}/photodiode.fif', picks='dc1', overwrite=True)
    ### SAR EDIT ###

    # 5) Clean up the MNE data 

    bads = detect_bad_elecs(mne_data, sEEG_mapping_dict)

    mne_data.info['bads'] = bads

    return mne_data

(Related to point 1): Some more troubleshooting now leads me to believe the ch_name list should append the full file prefix, not the integer at the end, so maybe something more like this is better:

ch_name.append(chan_name[:-5])

Hey,

I finally dug into this. Basically, a lot of filetype ('.edf' vs. '.ncs') and recording site ('MSSM' vs. other) specification had to be added, with the added complication of dealing with cases where sampling rate differs across channels.

I've added a branch to this issue to work on this. The main change is in the make_mne function below. Try this out and let me know if it works - I tried it only with MS022 so far.

def make_mne(load_path=None, elec_data=None, format='edf', site='MSSM', overwrite=True, **kwargs):
    """
    Make a mne object from the data and electrode files, and save out the photodiode. 
    Following this step, you can indicate bad electrodes manually.

    Parameters
    ----------
    load_path : str
        path to the neural data
    elec_data : pandas df 
        dataframe with all the electrode localization information
    format : str 
        how was this data collected? options: ['edf', 'nlx]
    site: str
        where was the data collected? options: ['UI', 'MSSM'].
        TODO: add site specificity for UC Davis
    overwrite: bool 
        whether to overwrite existing data for this person if it exists 
    kwargs: dict
        dictionary containing lists of different types of channel names 

    Returns
    -------
    mne_data : mne object 
        mne object
    """

    # OPTIONAL: Set specific channel names that you might need: 
    eeg_names = kwargs['eeg_names']
    resp_names = kwargs['resp_names']
    ekg_names = kwargs['ekg_names']
    photodiode_name = kwargs['photodiode_name']
    seeg_names = kwargs['seeg_names']

    if site == 'MSSM':
        if not eeg_names: # If no input, assume the standard EEG montage at MSSM
            eeg_names = ['fp1', 'f7', 't3', 't5', 'o1', 'f3', 'c3', 'p3', 'fp2', 'f8', 't4', 't6', 'o2', 'f4', 'c4', 'p4', 'fz', 'cz', 'pz']

    # 1) load the data:
    if format=='edf':
        # MAKE SURE ALL THE EDF CHANNELS HAVE THE SAME SR! See: https://github.com/mne-tools/mne-python/issues/10635

        if not photodiode_name: 
            photodiode_name = 'dc1'
        # EDF data always comes from MSSM AFAIK. Modify this if that changes.

        # This is a big block of data. Have to load first, then split out the sEEG and photodiode downstream. 
        edf_file = glob(f'{load_path}/*.edf')[0]
        mne_data = mne.io.read_raw_edf(edf_file, preload=True)

        # The electrode names read out of the edf file do not always match those 
        # in the pdf (used for localization). This could be error on the side of the tech who input the labels, 
        # or on the side of MNE reading the labels in. Usually there's a mixup between lowercase 'l' and capital 'I'.

        # Sometimes, there's electrodes on the pdf that are NOT in the MNE data structure... let's identify those as well. 
        new_mne_names, _, _ = match_elec_names(mne_data.ch_names, elec_data.label)
        # Rename the mne data according to the localization data
        new_name_dict = {x:y for (x,y) in zip(mne_data.ch_names, new_mne_names)}
        mne_data.rename_channels(new_name_dict)

        if not seeg_names:
            seeg_names = [i for i in mne_data.ch_names if ((i.startswith('l')) | (i.startswith('r')))]
        sEEG_mapping_dict = {f'{x}':'seeg' for x in seeg_names}

        mne_data.set_channel_types(sEEG_mapping_dict)

        mne_data.info['line_freq'] = 60
        # Notch out 60 Hz noise and harmonics 
        mne_data.notch_filter(freqs=(60, 120, 180, 240))

        # Save out the photodiode channel separately
        mne_data.save(f'{load_path}/photodiode.fif', picks=photodiode_name, overwrite=overwrite)

        # drop EEG and EKG channels
        drop_chans = list(set(mne_data.ch_names)^set(seeg_names))
        mne_data.drop_channels(drop_chans)

        bads = detect_bad_elecs(mne_data, sEEG_mapping_dict)
        mne_data.info['bads'] = bads

        mne_data.save(f'{load_path}/lfp_data.fif', picks=seeg_names, overwrite=overwrite)

    elif format =='nlx': 
        # This is a pre-split data. Have to specifically load the sEEG and photodiode individually.
        signals = [] 
        srs = [] 
        ch_name = [] 
        ch_type = []
        if site == 'MSSM': 
            # per Shawn, MSSM data seems to sometime have a "_0000.ncs" to "_9999.ncs" appended to the end of real data
            pattern = re.compile(r"_\d{4}\.ncs")  # regex pattern to match "_0000.ncs" to "_9999.ncs"
            ncs_files = [x for x in glob(f'{load_path}/*.ncs') if re.search(pattern, x)]
            # just in case this changes in the future: 
            if len(ncs_files) == 0: 
                ncs_files = glob(f'{load_path}/*.ncs')
                if not seeg_names:
                    seeg_names = [x.split('/')[-1].replace('.ncs','') for x in glob(f'{load_path}/[R,L]*.ncs')]
            else:
                if not seeg_names:
                    seeg_names = [x.split('/')[-1].replace('.ncs','').split('_')[0] for x in glob(f'{load_path}/[R,L]*.ncs') if re.search(pattern, x)]
        elif site == 'UI':
            # here, the filenames are not informative. We have to get subject-specific information from the experimenter
            ncs_files = glob(f'{load_path}/LFP*.ncs')
        if not seeg_names: 
            raise ValueError('no seeg channels specified')
        else:
            # standardize to lower
            seeg_names = [x.lower() for x in seeg_names]
            sEEG_mapping_dict = {f'{x}':'seeg' for x in seeg_names}

        for chan_path in ncs_files:
            chan_name = chan_path.split('/')[-1].replace('.ncs','')
            # strip the file type off the end if needed 
            if '_' in chan_name:
                chan_name = chan_name.split('_')[0]
            try:
                fdata = nlx_utils.load_ncs(chan_path)
            except IndexError: 
                print(f'No data in channel {chan_name}')
                continue
            #  surface eeg
            if eeg_names:
                eeg_names = [x.lower() for x in eeg_names]
                if chan_name.lower() in eeg_names:
                    ch_type.append('eeg')
            if resp_names:
                resp_names = [x.lower() for x in resp_names]
                if chan_name.lower() in resp_names:
                    ch_type.append('bio')
            if ekg_names:
                ekg_names = [x.lower() for x in ekg_names]
                if chan_name.lower() in ekg_names: 
                    ch_type.append('ecg') 
            if seeg_names: 
                if chan_name.lower() in seeg_names:
                    ch_type.append('seeg')  
                elif chan_name.lower()[0] == 'u':
                    # microwire data
                    ch_type.append('seeg')  
            signals.append(fdata['data'])
            srs.append(fdata['sampling_rate'])
            ch_name.append(chan_name)
            if len(ch_type) < len(ch_name):
                ch_type.append('misc')
                print(f'Unidentified data type in {chan_name}')

        if np.unique(srs).shape[0] == 1:
            # all the sampling rates match:
            info = mne.create_info(ch_name, np.unique(srs), ch_type)
            mne_data = mne.io.RawArray(signals, info)
        else:
            ## Now we have to account for differing sampling rates. This will only really happen in the case of data where ANALOGUE channels 
            ## are recorded at a much higher sampling rate, or with micro channels. Find the lowest sample rate, and downsample everything to that.
            ## I generally don't like this but it should be OK. Make sure that you identify synchronization times AFTER downsampling the analogue channel, and not before:
            ## https://gist.github.com/larsoner/01642cb3789992fbca59

            target_sr = np.min(srs)
            mne_data_resampled = []

            for sr in np.unique(srs):
                ch_ix = np.where(srs==sr)[0].astype(int)
                info = mne.create_info([x for ix, x in enumerate(ch_name) if ix in ch_ix], sr, [x for ix, x in enumerate(ch_type) if ix in ch_ix])
                mne_data_temp = mne.io.RawArray([x for ix, x in enumerate(signals) if ix in ch_ix], info)
                if sr != target_sr:
                    # resample down to one sample rate 
                    mne_data_temp.resample(sfreq=target_sr, npad='auto', n_jobs=-1)
                    mne_data_resampled.append(mne_data_temp)
                else: 
                    mne_data = mne_data_temp

            #Because of the resampling, the end timings might not match perfectly:https://github.com/mne-tools/mne-python/issues/8257
            if mne_data_resampled[0].tmax > mne_data.tmax:
                mne_data_resampled[0].crop(tmin=0, tmax=mne_data.tmax)
            elif mne_data_resampled[0].tmax < mne_data.tmax:
                mne_data.crop(tmin=0, tmax=mne_data_resampled[0].tmax)

            mne_data.add_channels(mne_data_resampled)

            mne_data.info['line_freq'] = 60
            # Notch out 60 Hz noise and harmonics 
            mne_data.notch_filter(freqs=(60, 120, 180, 240))

            new_name_dict = {x:x.replace(" ", "").lower() for x in mne_data.ch_names}
            mne_data.rename_channels(new_name_dict)

            # Save out the photodiode channel separately
            if not photodiode_name:
                raise ValueError('no photodiode channel specified')
            else:
                mne_data.save(f'{load_path}/photodiode.fif', picks=photodiode_name, overwrite=overwrite)

            # Save out the respiration channels separately
            if resp_names:
                mne_data.save(f'{load_path}/respiration_data.fif', picks=resp_names, overwrite=overwrite)

            # Save out the EEG channels separately
            if eeg_names: 
                mne_data.save(f'{load_path}/scalp_eeg_data.fif', picks=eeg_names, overwrite=overwrite)

            # Save out the EEG channels separately
            if ekg_names:
                mne_data.save(f'{load_path}/ekg_data.fif', picks=ekg_names, overwrite=overwrite)

            drop_chans = list(set([x.lower() for x in mne_data.ch_names])^set(seeg_names))
            mne_data.drop_channels(drop_chans)

            bads = detect_bad_elecs(mne_data, sEEG_mapping_dict)
            mne_data.info['bads'] = bads

            mne_data.save(f'{load_path}/lfp_data.fif', picks=seeg_names, overwrite=overwrite)

    return mne_data

seqasim / LFPAnalysis