snap-stanford / SATURN

MIT License
108 stars 17 forks source link

How indices_tuple is always empty tensors? #24

Closed MohammedZidane closed 1 year ago

MohammedZidane commented 1 year ago

Hi, in the function get_species_triplet_indices in file loss_and_miner_utils.py, I do not get why the a_idx, p_idx, and n_idx are always empty tensors. What I got from the paper is that there should be some indices for the positive, the negative, and the anchor cells.

Could you clarify this point to me?

and I have anothe related question: the selection process of anchor cells, does it happen in one specie or in both species?

Thanks!

Yanay1 commented 1 year ago

Those shouldn't be empty, which dataset are you using?

The anchors are chosen using every unique label from every species (in each mini batch).

MohammedZidane commented 1 year ago

Thanks for your reply. I am using the frog and zebrafish datasets. Here is what I got

p_inds: tensor([ 750,  788,  790,  812,  823,  827,  829,  834,  842,  858,  863,  874,
         900,  919,  929,  955,  956,  963,  979, 1011, 1034, 1038, 1094, 1113,
        1122, 1125, 1131, 1165, 1182, 1183, 1190, 1207, 1249, 1263, 1282, 1290,
        1299, 1303, 1345, 1362, 1363, 1368, 1378, 1379, 1384, 1408, 1409, 1416,
        1430, 1441, 1446, 1450, 1480, 1481, 1492], device='cuda:0')
n_inds: tensor([   0,    1,    2,  ..., 1497, 1498, 1499], device='cuda:0')
torch.where(species[a_inds] == sp)[0]: tensor([], device='cuda:0', dtype=torch.int64)
torch.where(species[a_inds] == sp)[0]: tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
        36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
        54], device='cuda:0')
indices_tuple: (tensor([], device='cuda:0', dtype=torch.int64), tensor([], device='cuda:0', dtype=torch.int64), tensor([], device='cuda:0', dtype=torch.int64))
Epoch 50 Iteration 0: Loss = 0.0, Number of mined triplets = 0

Interestingly, you will see that at the beginning of the function 'get_species_triplet_indices', a_inds and p_inds have values but then the function returns empty tensors. You will aslo see that I printed 'torch.where(species[a_inds] == sp)[0]:' here

        for sp in unique_species:
            **print('torch.where(species[a_inds] == sp)[0]:', torch.where(species[a_inds] == sp)[0])**
            a_sp = a_inds[torch.where(species[a_inds] == sp)[0]]       

            n_a_sp = a_sp.shape[0]
            p_not_sp = p_inds[torch.where(species[p_inds] != sp)[0]]
            n_p_sp = p_not_sp.shape[0]
            #print(f"for species {sp}, {n_a_sp}, {n_p_sp}")
            if n_p_sp > 0:
                k = n_p_sp
                num_triplets_sp = n_a_sp * k    
                a_ = torch.arange(n_a_sp).view(-1, 1).repeat(1, k).view(num_triplets_sp)
                p_not_sp_ = p_not_sp.expand((n_a_sp, n_p_sp))
                p_ = torch.randint(0, n_p_sp, (num_triplets_sp,))
                p.append(p_not_sp_[a_, p_])
                a.append(a_sp[a_])
                #print(f"Adding {a_inds[a_].shape}, {p_inds_[a_, p_].shape}")
                # print(p_inds_[a_, p_].shape)

        a = torch.cat(a)
        p = torch.cat(p)

As shown, one of the two species gives empty tensor. I am not sure if this has somethig to do with the return of the function.

Thanks!

Yanay1 commented 1 year ago

What batch size are you using? At what epoch does this happen? Are you using integrating both species?

MohammedZidane commented 1 year ago

Hi, Thanks for your reply. the bactch size is the one determined in the code. I think it is in the forward function: batch_size = inp.shape[0]. I subsampled the frog and zebrafish data because it is too big. I have 750 cells for each so the bactch_size I am getting is 750.

This happenes in all the epochs.

Are you using integrating both species? I am not sure if I understand the question. I am using the code as it is and input the data as shown in the instructions. If you mean I am concatenating the two species together or so, no I did not. Note: Although I am getting empty indices, the umap and pca graphs are good.

I hope you could help me figuring out this issue. Thanks!

Yanay1 commented 1 year ago

Did you subsample the entire dataset to 750 cells for each species, or set the batch size to 750?

The function can return no triplets when there are no triplets to mine, which may happen after many epochs or if there is a small amount of data or small batch size.

MohammedZidane commented 1 year ago

Thanks for your reply :) I subsampled the data to 750 cell for each specie. I did not set the batch size to 750. I tried 15000 cells for each specie but still getting empyt indices_tuple. However, I noticed an interesting thing

torch.where(species[a_inds] == sp)[0]: tensor([], device='cuda:0', dtype=torch.int64)
torch.where(species[a_inds] == sp)[0]: tensor([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
         14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,
         28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,
         42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,
         56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,
         70,  71,  72,  73,  74,  75,  76,  77,  78,  79,  80,  81,  82,  83,
         84,  85,  86,  87,  88,  89,  90,  91,  92,  93,  94,  95,  96,  97,
         98,  99, 100, 101, 102, 103, 104], device='cuda:0')

torch.where(species[a_inds] == sp)[0]: is inside a for loop that loops over the unique_species (0 and 1 in my case). Always one of the species return empty tensor as you can see. Do you think that has anything to do with the problem I am facing?

Thanks!

Yanay1 commented 1 year ago

How are you testing out this code? Are you running the main script?

The anchor idxs will always start as just one species. It is weird that this is returning consecutive indices though.

It seems whatever data is being passed isn't in a randomized order maybe?

MohammedZidane commented 1 year ago

Thanks for your reply! I made a new installtion for the pipeline and the indices_tuple is not empty anymore :). Interestingly, I am still getting consecutive indices for and for both species not only one, I mean sometimes for both species and sometimes for only one. Do you think that makes sense?

I am running the main script according to the instructions, I just changed the data path to the path of the subsampled frog and zebrafish data.

Do I need to randomize the data myself or it is part of the code?

Thanks!

MohammedZidane commented 1 year ago

The issue is resolved :)