neurodata / primitives-interfaces

A mirror of D3M's primitives-interfaces
Apache License 2.0
1 stars 4 forks source link

load_graph: attributes #72

Open alyakin314 opened 4 years ago

alyakin314 commented 4 years ago

part of #66

for graphs that are taken as edgelists the node attributes are not attached to the graph and thus are not getting embedded in the ase, which leads to terrible performance. they should be attached to the graph object at the stage when the graphs is read.

alyakin314 commented 4 years ago

unfortunately this is almost unfeasible for the classification task (which happens to be the one this matters for the most) at the current stage. this is because the attributes are stored in the learningdata.csv, which gets train-test splitted. hence, we do not have access to the attributes of the testing part of the data at the time we embed in training.

there are some very weird work-arounds. one includes accessing the full (not train-test split) dataset. this both needs to be done in a very hacky way with weird path manipulations that are very fragile AND is borderline cheating because that csv file has labels for testing datas.

another is a "lazy" approach which includes rewriting our whole framework to not do the embedding at the train time, but only freeze those nodes with their attributes. then at the test time we join them with the test attributes, do the embedding, learn the classifier and predict. this is both counter-intuitive to the foundational ml ideas, because our classifier will be trained right before test-time and will very likely require to rewrite a significant portion of our framework (ase and gclass notably).

this issue is dropped until a reasonable change in the way d3m handles graph attributes.

CC: @hhelm10 @bvarjavand

alyakin314 commented 4 years ago

the easiest way this can be resolved is to request Mitar or Swaroop to include nodeID+Attributes csv for all vertices and nodeID+label csv for the training vertices. or, alternatively to have one csv that has nodeIDs+Attributes for all and labels for the ones that are being trained on (rest can be nan for example). the former seems more intuitive and easier to adjust to, but both would resolve the embedding issue.

alyakin314 commented 4 years ago

attributes for datasets with edgelists are now provided as nodelists. first of all, this is awsome. second of all, we need to add a way to load them in now.