Question for function "DUCATI.CacheConstructor.separate_features_idx"

Changyuan0825 commented 7 months ago

I would like to ask, why do we need to use randomly generated feature vectors (i.e., fake input)? If I misunderstood, could you please tell me the meaning of the function DUCATI.CacheConstructor.separate_features_idx? The following is your raw code:

def separate_features_idx(args, graph):
    separate_tic = time.time()
    train_idx = torch.nonzero(graph.ndata.pop("train_mask")).reshape(-1)
    adj_counts = graph.ndata.pop('adj_counts')
    nfeat_counts = graph.ndata.pop('nfeat_counts')

    # cleanup
    graph.ndata.clear()
    graph.edata.clear()

    # we prepare fake input for all datasets
    fake_nfeat = dgl.contrib.UnifiedTensor(torch.rand((graph.num_nodes(), args.fake_dim), dtype=torch.float), device='cuda')
    fake_label = dgl.contrib.UnifiedTensor(torch.randint(args.n_classes, (graph.num_nodes(), ), dtype=torch.long), device='cuda')

    mlog(f'finish generating random features with dim={args.fake_dim}, time elapsed: {time.time()-separate_tic:.2f}s')
    return graph, [fake_nfeat, fake_label], train_idx, [adj_counts, nfeat_counts]

initzhang commented 7 months ago

Hi @Changyuan0825 , thanks for your interest in our work! The main purpose is to save disk storage. Since the time of accessing/caching node features will not be influenced by its actual content, we can verify the training time of DUCATI with random node features and avoid storing/loading them to/from disk. As you can see from the code here, you only need to store and load the adjacency data and avoid writing/reading large nfeat data to/from disk, which saves you a lot of time. However, when verifying accuracy and convergence, you need to load the real node features instead.

Changyuan0825 commented 7 months ago

I get it. Thank you for your response!

initzhang / DUCATI_SIGMOD

Question for function "DUCATI.CacheConstructor.separate_features_idx" #7