int-brain-lab / iblenv

Unified environment and Issue tracker for all IBL
MIT License
10 stars 10 forks source link

[EXTRACT] spike_times.npy seems to be missing from some sessions? #343

Closed chrisan97 closed 5 months ago

chrisan97 commented 1 year ago

Hello, I am trying to extract all recordings from VISp. I did this by

# Given brain area acronym to check
brain_acronym = 'VISp'
sessions = one.search(atlas_acronym=brain_acronym, query_type='remote')
print(f'No. of detected sessions: {len(sessions)}')
insertions = one.search_insertions(atlas_acronym=brain_acronym, query_type='remote')
print(f'No. of detected insertions: {len(insertions)}')

# Check that there are spikes.times.npy file in each sessions just for quality control
good_insertions = []
for pid in insertions:
    eid, name = one.pid2eid(pid)
    sessionFileList = one.list_datasets(eid)
    for filename in sessionFileList:
        if filename == f'alf/{name}/pykilosort/spikes.times.npy':
            good_insertions.append(pid)

# The length of good_insertions should be the same as # of detected sessions
len(good_insertions)

The number of good_insertions come out to be 47, but the # of detected insertions in VISp come to be 49. I noticed this issue while I was trying to do

# Load in relevant data for each insertion to create a table containing cluster amplitude and brain location
VISp_clusters = []
for pid in good_insertions: # only iterate through pids in good insertions
  eid, name = one.pid2eid(pid)
  # Load in clusters data that we need
  clusters = one.load_object(eid, 'clusters', attribute=['amps', 'channels','depths','peakToTrough'], collection=f'alf/{name}/pykilosort')
  # Load in channels data that we need. This gives the Allen CCF location of each channel
  channels_atlas_id = one.load_dataset(eid, 'channels.brainLocationIds_ccf_2017', collection=f'alf/{name}/pykilosort')
  # Use channels_atlas_id and clusters.channels to compute the atlas_id of each cluster
  clusters['atlas_ids'] = channels_atlas_id[clusters['channels']]
  # Use regions object to convert atlas ids to acronyms
  clusters['acronyms'] = ba.regions.id2acronym(clusters['atlas_ids'])
  # Convert clusters object to pandas dataframe and add the pid as a column
  clusters_df = clusters.to_df()
  clusters_df['pid'] = pid
  # Append table from individual pid to list
  VISp_clusters.append(clusters_df)

# concatenate into one big table
VISp_clusters = pd.concat(VISp_clusters)

which was to create a table of clusters that are in VISp. If I iterate through the insertions pids instead of good_insertions, it flips me an error, presumably because some files are missing from the database.

I have also noticed that there is an alternative way to extract units, which is to use the cluster tables directly from the database. What is more accurate at this point? Should I resort to the below method instead of the above method?

from one.remote import aws
s3, bucket_name = aws.get_s3_from_alyx(alyx=one.alyx)

# Define where the table will be saved
table_path = one.cache_dir.joinpath('bwm_features', 'clusters_table.pqt')
# Download the table
aws.s3_download_file("aggregates/2022_Q4_IBL_et_al_BWM/clusters.pqt", table_path, s3=s3, bucket_name=bucket_name)

# Load in the file as a pandas table
clusters_table = pd.read_parquet(table_path)

Thanks, Chris

juhuntenburg commented 5 months ago

Hi @chrisan97,

Thank you for your question and sorry for the late reply, this slipped through the cracks. Not sure if this is still useful for you, but I'll leave the answer here for completeness.

Your are absolutely right in finding that discrepancy between all insertions on the public database, and the insertions that have the spikes.times.npy dataset. The two insertions for which we didn't release the spikes.times.npy didn't pass quality control. I am not sure why they were released at all, but sometimes only part of the data might be made public for specific reasons.

The easiest way to adapt your query to only return the probe insertions that pass through VISp and have spikes.times.npy data is like this:

insertions = one.search_insertions(atlas_acronym=brain_acronym, datasets='spikes.times.npy', query_type='remote')

As of today, this will return 70 insertions, while omitting the datasets keyword argument will return 72.

Regarding the clusters table, this is a convenience dataset specific to the brainwidemap release. The clusters table and the individual datasets on the database are synced, i.e. the data contained in this table, and the data you will obtain from downloading individual datasets from the database are the same. Note, however, that the clusters table does not contain the spike times, so if you need that information you will still have to download the spikes.times.npy datasets. The best way to access this data is by using the SpikeSortingLoader.

Also, the clusters table only contains probe insertions that are part of the brainwidemap release. Currently I believe all insertions on the database are represented in the clusters table, however, in the future the above query might return insertions that are not part of the brainwidemap release, and thus not in the clusters table.

If you want to download the most up-to-date version of the clusters table, please use this function from the paper-brain-wide-map repository.

Hope this helps and sorry again for the delay, Julia

(I am closing this issue but feel free to re-open if you have further questions)