Release the vocabulary/gene map

Egiob commented 5 months ago

Hello, I understand that Nicheformer operates on a vocabulary of 20,310 genes. But I can't find in this repo the map that would allow to convert let's say an ensembl ID, or a gene name, to an id (i.e. a token) in your vocabulary.

Could you provide this gene map please? Or indicate how you constructed it?

Thank you so much.

yehuicheng2002 commented 3 months ago

@Egiob Hello, have you solved this problem now?

dimalvovs commented 1 month ago

could it be that that the mapping is obtained like this (so that the token 10723 is ENSG00000000003)?

h5ad = sc.read_h5ad("nicheformer/data/model_means/model.h5ad")
h5ad.X
  (0, 10723)    1.0
  (0, 12184)    4.0
  (0, 5297) 1.0
  (0, 17537)    1.0
  (0, 6145) 1.0
  (0, 13799)    1.0
  (0, 3204) 1.0
  (0, 19265)    1.0

h5ad.X.shape
(1, 20310)

h5ad.var
Empty DataFrame
Columns: []
Index: [ENSG00000000003, ENSG00000000005, ENSG00000000419, ENSG00000000457, ENSG00000000460, ENSG00000000938, ENSG00000000971, ENSG00000001036, ENSG00000001084, ENSG00000001167, ENSG00000001460, ENSG00000001461, ENSG00000001497, ENSG00000001561, ENSG00000001617, ENSG00000001626, ENSG00000001629, ENSG00000001630, ENSG00000001631, ENSG00000002016, ENSG00000002330, ENSG00000002549, ENSG00000002586, ENSG00000002587, ENSG00000002726, ENSG00000002745, ENSG00000002746, ENSG00000002822, ENSG00000002834, ENSG00000002919, ENSG00000002933, ENSG00000003056, ENSG00000003096, ENSG00000003137, ENSG00000003147, ENSG00000003249, ENSG00000003393, ENSG00000003400, ENSG00000003402, ENSG00000003436, ENSG00000003509, ENSG00000003756, ENSG00000003987, ENSG00000003989, ENSG00000004059, ENSG00000004139, ENSG00000004142, ENSG00000004399, ENSG00000004455, ENSG00000004468, ENSG00000004478, ENSG00000004487, ENSG00000004534, ENSG00000004660, ENSG00000004700, ENSG00000004766, ENSG00000004776, ENSG00000004777, ENSG00000004779, ENSG00000004799, ENSG00000004809, ENSG00000004838, ENSG00000004846, ENSG00000004848, ENSG00000004864, ENSG00000004866, ENSG00000004897, ENSG00000004939, ENSG00000004948, ENSG00000004961, ENSG00000004975, ENSG00000005001, ENSG00000005007, ENSG00000005020, ENSG00000005022, ENSG00000005059, ENSG00000005073, ENSG00000005075, ENSG00000005100, ENSG00000005102, ENSG00000005108, ENSG00000005156, ENSG00000005175, ENSG00000005187, ENSG00000005189, ENSG00000005194, ENSG00000005206, ENSG00000005238, ENSG00000005243, ENSG00000005249, ENSG00000005302, ENSG00000005339, ENSG00000005379, ENSG00000005381, ENSG00000005421, ENSG00000005436, ENSG00000005448, ENSG00000005469, ENSG00000005471, ENSG00000005483, ...]

[20310 rows x 0 columns]

dimalvovs commented 1 month ago

Oh based on the ipnbs it looks even simpler and we can just the the gene ordering from the model.h5ad:

#Loading model with right gene ordering
model = sc.read_h5ad(
    f"{BASE_PATH}/model.h5ad"
)
...
#Concatenation
#Next we concatenate the model and the dissociated object to ensure they are in the same order. This ensures we have the same gene #ordering in the object.

adata = ad.concat([model, dissociated], join='outer', axis=0)

theislab / nicheformer

Release the vocabulary/gene map #10