rmarkello / abagen

A toolbox for working with Allen Human Brain Atlas microarray expression data
https://abagen.readthedocs.io
BSD 3-Clause "New" or "Revised" License
92 stars 41 forks source link

Strange sample numbers/`NaN` expression values returned when using text file-based masks/parcellation #203

Open alyssadai opened 2 years ago

alyssadai commented 2 years ago

I am trying to obtain sample-level expression data for each of 8 components (~parcels) from a data-driven cortical decomposition performed in CIVET space. I have both a LH/RH binary mask per component, with value 0/1 for each of 41k vertices, and a whole-brain atlas version of the components comprising a LH/RH file with a label between 1-8 per vertex.

Issue:

The samples returned for each component by get_samples_in_mask() sum up to 8695 samples total, which exceeds the number of unique AHBA samples (~3500). Based on the well_id values of the output dataframes, it appears that some samples are returned for >1 mask, even though the masks are non-overlapping.

Using instead the whole-brain components atlas with get_expression_data(atlas, region_agg=None), I get an expression dataframe with a more reasonable *1678 samples, but this dataframe unusually contains some NaN values - so it seems this approach is not working exactly as expected either.

*EDIT: This previously incorrectly said 1664, which is actually the number of samples I get if I don't use neuromaps.images.relabel_gifti() when loading in my LH/RH label giftis. 1678 samples is reproducible with the code below.

Steps to reproduce:

The masks/atlas were converted from .txt to .gii format using neuromaps.images.construct_shape_gii(). The CIVET midsurfaces used to construct abagen-style atlases were converted from .obj to .gii using neuromaps.images.obj_to_gifti().

Data for reproducing bug: abagen_sample_bug_data.zip

Code used to obtain expression data:

# Python 3.8.5
import neuromaps
import abagen           # v0.1.3
import nibabel as nib   # v3.2.1
import pandas as pd     # v1.1.2
import numpy as np      # v1.19.2

ncomps=8
surface = ('surfaces/PS_N266_lh_average_midsurface.surf.gii', 'surfaces/PS_N266_rh_average_midsurface.surf.gii')

'''
1. Getting sample-level expression data for each component using binary masks
'''
def get_mask_exp(c):
    component = (f'binary_masks/lh_component{c}.label.gii', f'binary_masks/rh_component{c}.label.gii')
    mask = abagen.check_atlas(component, geometry = surface, space = 'fslr')
    samples, coords = abagen.get_samples_in_mask(mask)
    return samples

# run get_mask_exp() for each component
samples_list=[]
for c in range(1,ncomps+1):
    mysamples = get_mask_exp(c)
    samples_list.append(mysamples)
    mysamples.to_pickle(f'samples{c}.pkl')

samples_stacked = pd.concat(samples_list, axis=0)
samples_stacked.to_pickle('samples_stacked.pkl')

'''
2. Getting sample-level expression data for each component using a whole-brain atlas
'''
# relabel GIFTI images such that labels for atlas are consecutive across hemispheres (so, component 1 = labels 1 and 9, component 2 = labels 2 and 10, etc.)
components = neuromaps.images.relabel_gifti(('atlas/lh_components.label.gii', 'atlas/rh_components.label.gii'))

atlas = abagen.check_atlas(components, geometry = surface, space = 'fslr') # returns: AtlasTree[n_rois=16, n_vertex=77122]
allsamples = abagen.get_expression_data(atlas, region_agg=None)

# check for NaN values
nan_total = allsamples.isna().sum().sum() # 463
nan_whichgenes = allsamples.columns[allsamples.isna().any()].to_list() # ['MALAT1', 'RPS4Y2']

### extract + save samples for each component from atlas output ###
# samples_bycomp = []
# for c in range(1,ncomps+1):                                                 # for each component number
#     samples_bycomp.append(allsamples.loc[[c,ncomps+c],])                    # store dataframe containing rows of `allsamples` with LH/RH `label` corresponding to component
#     samples_bycomp[c-1].reset_index(level='label', drop=True, inplace=True) # drop `label` index level
#     samples_bycomp[c-1].to_pickle(f'atlas_samples{c}.pkl')
#
# atlas_samples_stacked = pd.concat(samples_bycomp, axis=0)
# atlas_samples_stacked.to_pickle('atlas_samples_stacked.pkl')

Any thoughts on how to troubleshoot this would be much appreciated!

alyssadai commented 2 years ago

I corrected a small typo I noticed in the issue. Please see "*EDIT" in the comment above. Sorry for any confusion!