Presently it is very difficult to architect a concise way to normalise attribute values.
Current problems include:
All Things in the graph are treated as if they could be an attribute. This means that:
All Things must support a field for long, double, string, date and boolean in case they are an Attribute. In the case that a Thing is an Attribute, then only one of these values will be set to non-default.
The vast majority of the attribute value fields are set to a default value. This obfuscates the meaning of zero. In some cases it means an actual value of zero, in others it is present because the Thing is not an Attribute . This is particularly difficult to handle in the case of dates where in unix time zero is Thursday, 1 January 1970.
Attributes need to be normalised by Type, otherwise the distribution of values from one type will impact that of another
Normalisation needs to be calibrated on the training set, and the parameters used to normalise data passed subsequently.
Encoding of the input data takes place inside the TensorFlow computation graph, adding normalisation there may be non-trivial, and there aren't any OOTB components from TensorFlow like the preprocessing.StandardScaler() of scikit-learn
Should be made easier to accomplish by solving #51
As of the 0.2 release these points have been addressed by a very different architecture.
Attribute value embedding is now type-centric, this means
Values for entities and relations are ignored entirely.
Each attribute can now be embedded using its own neural network component. In this way attribute values only need to be consistently normalised on a by-type basis.
The normalisation can therefore be performed prior to data being fed to the network
Presently it is very difficult to architect a concise way to normalise attribute values.
Current problems include:
long
,double
,string
,date
andboolean
in case they are an Attribute. In the case that a Thing is an Attribute, then only one of these values will be set to non-default.preprocessing.StandardScaler()
ofscikit-learn
Should be made easier to accomplish by solving #51