linnarsson-lab / loompy

Python implementation of the Loom file format - http://loompy.org
BSD 2-Clause "Simplified" License
139 stars 37 forks source link

Change cell names in loom file #157

Closed brianherb closed 3 years ago

brianherb commented 3 years ago

Hi all,

I'm systematically reprocessing a large amount of smart seq data and the pipeline I'm using processes multiple individual fastq files (one per cell) and produces a single loom file containing counts of all cells. Unfortunately, the way that I have to run this pipeline I have to use the filenames as cell id's instead of the shorter cell id I actually want. What I would like to do is swap the file name for a cell id of my choosing. Is there a straightforward way to do this? I attempted a solution with the loompy module in python, but I'm worried that now there are no cell id's in the new loom file.

for example - what I see in the original .obs slot:

image

and what I want to change it to:

image

but when I try to rebuild the loom file, I lose the cell id in the index:

image

Here is what I tried so far:

data = scanpy.read_loom(filename=lf) sampleName = re.sub('.loom','',lf) Obs = data.obs Gene = data.var Mat = data.layers['intron_counts'].toarray().transpose() ## only one layer in this dataset

Obs2 = copy.deepcopy(Obs) a = Obs['input_id'].tolist() # this is file names b = lookUp['file'].tolist() # lookUp object contains file name to cell id mapping ind=[ b.index(x) if x in b else None for x in a ] ## index file names

newCell = lookUp['cell_id'][ind].tolist() oldCell = Obs['input_id'].tolist() cellDict = {oldCell[i]: newCell[i] for i in range(len(oldCell))} ## swap in new cell ids

Obs2.index = newCell Obs2.index.name = 'CellID'

Obs2["cell_names"].replace(cellDict, inplace=True)

convert to dictionary for loompy.create - am I loosing cell id here? How should I construct this dict?

L1 = Obs2.columns.tolist() L2 = Obs2.transpose().to_numpy() Obsd = {k:v for k,v in zip(L1,L2)}

get gene info back in loom object - also, am I dropping gene names here?

L1 = Gene.columns.tolist() L2 = Gene.transpose().to_numpy() Gened = {k:v for k,v in zip(L1,L2)}

loompy.create('test.loom',{'':Mat,'intron_counts':Mat}, Gened, Obsd)

slinnarsson commented 3 years ago

I think this is more a question for the scanpy people, since you're using the to_loompy() method. You can use loompy directly to set any attribute, e.g. something like:

with open(filename) as ds:
    old_ids = ds.ca.CellID[:]
    # Take the first 20 characters of each filename as the cellid
    new_ids = np.array([s[:20] for s in old_ids], dtype=object)
    ds.ca.CellID = new_ids
brianherb commented 3 years ago

Thank you! That worked quite well. Here is my modified code which uses the lookUp table (I like the trick with the first 20 char, but not always the case, unfortunately):

for lf in loomFiles:
 ds = loompy.connect(lf)
 old_ids = ds.ca.CellID[:]
 b = lookUp['file'].tolist()
 ind=[ b.index(x) if x in b else None for x in old_ids]
 new_ids = lookUp['cell_id'][ind].tolist()
 ds.ca.cell_names = new_ids
 ds.ca.CellID = new_ids
 ds.close()