jaxels20 / link-predection-on-dda

MIT License
0 stars 0 forks source link

Pipeline overview #12

Closed Kasper98-png closed 1 year ago

Kasper98-png commented 1 year ago

If PyTorch knows that the graph is Bipartite, the edge_index lists both start from 0, but they refer to nodes in each set (two disjoint sets of nodes in bipartite graph), https://www.youtube.com/watch?v=mz9xYNg9Ofs 18:00.

In RandomLinkSplit, _disjoint_trainratio determines how much of the training data is actually put into the training data set. Training data in RandomLinkSplit is whatever is left of the dataset after removing valdiation data and test data. When _disjoint_trainratio=1, none of the remaining data will be used as training data (or just one edge, edge_label_index is only 1 long). When _disjoint_trainratio=0, all of the remaining data is used as training data. This means that if you have 10% val and 20% test, then you expect training to be 70%, but this is only the case when _disjoint_trainratio=0. Otherwise, training data will be less than 70%. RandomLinkSplits puts a label on all edges that should be considered to be in the specific set. If training data is 70% then 70% of the edges in the graph gets a label and an edge_label_index. The edge_label_index is used in the LinkneighborLoader. Disjoint_train_ratio from documentation: _If set to a value greater than 0.0, training edges will not be shared for message passing and supervision. Instead, disjoint_trainratio edges are used as ground-truth labels for supervision during training. (default: 0.0)

LinkNeighborLoader: First selects a sample of nodes existing in the _edge_labelindex tensor. The nodes that exist in the _edge_labelindex are nodes that were selecting in the RandomLinkSplit. If _batchsize=100, it selects the first 100 edges from _edge_labelindex and then the next 100 edges for the next batch. Then it samples _numneighbors[i-1] for each node at each iteration i in a batch. Len(_numneighbors) determines the number of iterations. It is a good idea for directed graphs to have iterations/hops>=number of layers in GNN. You would think that after sampling neighbors, the set of sampled edges would be larger than _batchsize because the initial selection is of size _batchsize. How it reduces the number of edges to _batchsize, I dont know?

As far as I understand, the _edgelabel should match the _edge_labelindex such that the first entry in _edge_labelindex (representing an edge) has the first element in _edgelabel as label. Edge_index represents all edges in the graph and edge_label index represents all edges that have a label. The edges that have labels is included in the computations, for example after doing train-val-test split, only the edges with a label is used in the LinkNeighborLoader. The edge_index tensor is bigger than the edge_label_index tensor after train-test-val split. Before split the edges do not have a edge_label - edge_label_index and edge_label does not exist in the graph.

Kasper98-png commented 1 year ago

More about LinkNeighborLoader

It inherits from LinkLoader which selects some "example indices". At first it selects the first _batchsize indices (_batchsize=8, selects for the first batch [0, 1, 2, 3, 4, 5, 6, 7]). Together with the "example indices" it samples "the row of the edge index in COO format" and "the column of the edge index in COO format". These could be for example: [1523, 2172, 804, 406, 1615, 1162, 999, 3273]) and [790, 817, 781, 39, 95, 8, 45, 510]. They are not sequential like the "example indices". They correspond to the first _batchsize values in edge_label_index. So it does not perform any smart selection or anything fancy, but just selects the first values in edge_label_index (They are not ordered by value by the RandomLinkSplit, seems randomly ordered). From these it must be sampling edges according to _numneighbors. LinkLoader outputs an object with nodes, row, col, edge, batch=None, and metadata. Nodes, row, col, and edge is dictionaries. In the nodes dictionary we have keys (drug, disease). In row, col, and edge we have keys (('drug', 'may_treat', 'disease'), ('disease', 'rev_may_treat', 'drug')). The value of each key in each dictionary is a tensor with values (guess it is node indices). The tensor from row[('drug', 'may_treat', 'disease')] and the tensor from col[('drug', 'may_treat', 'disease')] becomes the edge_index. Metadata contains three tensors. First tensor is the "example indices". The second is a nested tensor with two lists. These two lists becomes the _edge_labelindex for the batch. The last tensor is edge_label.

Kasper98-png commented 1 year ago

Skåret ud i pap med udgangspunkt i dokumentation:

_"More specifically, this loader first selects a sample of edges from the set of input edges :obj:edge_label_index (which may or not be edges in the original graph) and then constructs a subgraph from all the nodes present in this list by sampling :obj:num_neighbors neighbors in each iteration."_

Når de skriver at den først vælger en sample fra de edges som er i edge_label_index så betyder det at den vælger de første _batchsize edges. Nu har den tre lister, hvor den første liste bare er en sekvens af indices (for første batch er det [0, 1, 2, ... , batch_size]). Den anden liste er de første indices fra from-nodes i edge_label_index. Den tredje liste er de første indices fra to-nodes i edge_label_index. Så sampler den naboer ud fra de nodes som lå først i edge_label_index. Jeg går ud fra at de edges som den ligger ind i edge_label_index kommer fra de edges den finder når den sampler naboer. Så kommer spørgsmålet hvordan den så vælger kun _batchsize edges ud fra alle de edges den fandt ved at sample naboer. Det resulterende edge_label_index som vi får i den første batch når batch size er 4 kan for eksempel være tensor([[0, 3, 1, 2], [1, 2, 0, 3]]). Den har altid indices som er mindre end batch size. Om den bare laver random edges mellem de første 4 nodes i drugs og diseases ved jeg ikke.... Alt magien ser ud til at ske i method sample_from_edges som bliver defineret i NeighborSampler.

Kasper98-png commented 1 year ago

LinkNeighborLoader

Kasper98-png commented 1 year ago

When we have run through all batches, we have run through the whole graph. There are no duplicate edges between the batches, all edges are included and the nodes involved in all the batch edges are the same nodes as in the graph's edge_index.