theislab / nicheformer

Repository for Nicheformer: a foundation model for single-cell and spatial omics
BSD 3-Clause "New" or "Revised" License
67 stars 2 forks source link

Release the vocabulary/gene map #10

Open Egiob opened 5 months ago

Egiob commented 5 months ago

Hello, I understand that Nicheformer operates on a vocabulary of 20,310 genes. But I can't find in this repo the map that would allow to convert let's say an ensembl ID, or a gene name, to an id (i.e. a token) in your vocabulary.

Could you provide this gene map please? Or indicate how you constructed it?

Thank you so much.

yehuicheng2002 commented 3 months ago

@Egiob Hello, have you solved this problem now?

dimalvovs commented 1 month ago

could it be that that the mapping is obtained like this (so that the token 10723 is ENSG00000000003)?

h5ad = sc.read_h5ad("nicheformer/data/model_means/model.h5ad")
h5ad.X
  (0, 10723)    1.0
  (0, 12184)    4.0
  (0, 5297) 1.0
  (0, 17537)    1.0
  (0, 6145) 1.0
  (0, 13799)    1.0
  (0, 3204) 1.0
  (0, 19265)    1.0

h5ad.X.shape
(1, 20310)

h5ad.var
Empty DataFrame
Columns: []
Index

[20310 rows x 0 columns]
dimalvovs commented 1 month ago

Oh based on the ipnbs it looks even simpler and we can just the the gene ordering from the model.h5ad:

#Loading model with right gene ordering
model = sc.read_h5ad(
    f"{BASE_PATH}/model.h5ad"
)
...
#Concatenation
#Next we concatenate the model and the dissociated object to ensure they are in the same order. This ensures we have the same gene #ordering in the object.

adata = ad.concat([model, dissociated], join='outer', axis=0)