vaticle / typedb

TypeDB: the polymorphic database powered by types
https://typedb.com
Mozilla Public License 2.0
3.72k stars 337 forks source link

Add vectors as an attribute type. #6327

Open jorenretel opened 3 years ago

jorenretel commented 3 years ago

Problem to Solve

There are many use cases where it would be useful to have vectors as attributes. For example, this data set https://ogb.stanford.edu/docs/linkprop/#ogbl-collab contains vector attributes and it would be practical to be able to import them in TypeDB like that. There are many more situations imaginable where entities have vector attributes. For instance small molecules can also be described by a vector. These vectors could, like scalar and categorical attributes, could be used in downstream machine learning.

Note: it might be worth immediately considering tensors with more than one dimension as well.

Current Workaround

You could save the vector as a string and parse it into a vector again in the client.

Proposed Solution

Add vectors as an attribute type.

Additional Information

-

jorenretel commented 3 years ago

As a comment. I came across this issue: https://github.com/vaticle/typedb/issues/5521 : Using arithmetic operations in Graql queries . It might be very likely that when vectors are included as attribute types, people might ask to be able to do simple arithmetic on them in queries as well. I don't want to open a whole crazy can of worms for you here though.

haikalpribadi commented 3 years ago

Great suggestion, @jorenretel . I'm not familiar with the OGBL dataset - can you elaborate further with an example here on what you have in mind? Using TypeQL language, can you try to make up the queries that would describe the data you're working with? If not, some generic code dealing with the "vectors" would be useful too.

jorenretel commented 3 years ago

Hi @haikalpribadi, OGB (https://ogb.stanford.edu/) is collection of benchmarks for machine learning tasks on graphs, divided in three different categories (node property prediction, link property prediction, graph property prediction). Each category has different data sets. There are kaggle-like leader boards of people/algorithms for each data set.

I wanted to import some of these data sets in TypeDB to see whether whether TypeDB could efficiently provide subgraphs that are used as training examples. Basically using TypeDB as a data providing backend for machine learning. Being able to import these datasets into TypeDB and subsequently comparing performance of the downstream ML to what is on the leader board would serve as a good sanity check to see whether the pipeline is working well.

In principle we don't need TypeDB to do machine learning on these graphs, as OGB comes with dataloaders for different graph ML libraries like Pytorch Geometric already. But when doing the same thing on other (potentially larger and more complex) knowledge graphs it would be handy to have TypeDB as a backend to manage the whole thing.

Anyway, to get back to the point: many of these data sets (of which the one I linked to is an example: https://ogb.stanford.edu/docs/linkprop/#ogbl-collab) have properties that are vectors. For instance, the example I linked to is a graph where each node is an author of scientific papers. The data set contains a vector of length 128 for each node in that graph describing the author (by averaging word embeddings of all papers written by that author). So these vectors are pre-existing in the data set, but there is no datatype for vectors in TypeDB, so not an obvious way to import these datasets.

jorenretel commented 3 years ago

To look at some of this data, it is probably the easiest to install ogb in a python environment:

pip install ogb

and subsequently do for instance:

from ogb.linkproppred.dataset import LinkPropPredDataset

dataset = LinkPropPredDataset(name = 'ogbl-collab')

print(dataset.__dict__.keys())
data = dataset[0]
for key, value in data.items():
    print(key)
    if value is not None:
        print(value.shape)
        print(type(value))
haikalpribadi commented 3 years ago

I see what you mean. And yeah, I do think it would be valuable to store vectors / arrays in TypeDB.

dynamic-modeller commented 3 years ago

As an adjunct to this thread, it's worth noting that a vector is essentially an array or basic list. Now these can currently be formed in TypeQL using a relation similar to the following:

temporal_list sub relation,
        owns intIndex,   # not a key
        owns op_type,    # enum of read/write
        owns current,    # boolean value
        relates owner,   # entity owner 
        relates item,    # the data fragment
        relates user_id, # user id 
        relates created; # finance_time

In the case of a basic vector, the index, the owner and the item is required, the user_id and created date are not. Now the problem here lies in:

  1. the length of the list is not known, so some mechanism to determine this is required, or a rule that gives the largest index, which can be converted to the length
  2. when inserting new members, the index of the new item is either the length, or the largest index + 1. Currently this requires two TypeQL queries as calculations are not supported

Vectors are useful, but this is because they are a form of an array, where indexing is assumed and the length is known. A basic list is the foundation construct, as this can be used in application as a vector, an array, or a list of dicts, as the case may be.

In short, it is lists that are the necessary basic element required to support vecotrs, arrays and other structures, and as pointed out these need to be capable of extending to multi-dimensions (e.g. a list of lists, or 2D array/vector). Whether they are built-in attribute types, or constructed using TypeQL, the functional requirements are the same:

MichaelSullivanArchitect commented 3 years ago

Lists/arrays need to be first class citizens so that one can "manage" them no differently than any other TypeDB component. Adding a 4th first class citizen to the current mix of three will address most use cases:

Presumably List-->List would also be allowed.

lveillard commented 1 year ago

This is the correct thread @haikalpribadi , sorry for that.

A first proposition to mix both cardinality (https://github.com/vaticle/typeql/issues/34) and vectors would be something like this ( 📝Not the best proposition but it might open debate so we can find a better way)

Attributes

sub attribute => value string; cardinality one sub attribute, list string X; => unordered list where X is an optional interval/set of cardinality posibilities sub attribute, array string X; => ordered list with same X behaviour

Relations

sub relation, relates A X, relates B Y ...=> current behaviour (unordered, X=[0,], Y=[0,] by default sub vector, relates A X, relates B Y.... => Ordered relation

Cardinality values

Needed array ops

Some ideas:

Defining cardinality constraints in the schema

define
#unordered attributes
name sub attribute, value string;            #cardinality ONE
unlimited-tags sub attribute, list string;            #unlimited, no minimum no maximum
dateInterval sub attribute, list datetime {0,2} ;             #can be 0 or 2

#ordered attribute
tags sub attrubute, array string [0,5];             #minimum 0, maximum 5 cardinality, ordered
classes sub attrubute, array string 2;             #cardinality TWO, ordered
nationalities sub attribute, list string [1,*] ;            #at least one

#unordered relations
$b1-a1 sub relation, relates book, relates author [1,*] #minimum one author per authorship relation

#ordered relations
podium sub vector, relates championship 1 , ordered-relates winners 3;        #relates 1 championship to 3 ordered winners
page-component sub vector, relates page 1, components [0,*];

adding at the end

insert
$a isa book, has unlimited-tags ['sci-fi', 'history`]  #if it already had some tags, these ones are getting added
$i isa trip, has dateInterval [2022-01-31, 2023-01-30];
$podium (championship: $winter22, winners: [$gold, $silver, $bronze]) isa podium; #if they have been set, also this will give an error
$page-component (page: $dashboard, components: [$comp2, $comp3, $comp4) isa page-component;  # adds 3 components in that order

removing from the end

replace
$a isa book, has name 'newTitle' #this would replace the current name directly, no need to delete the old one
$i isa trip, has dateInterval [2022-06-30, 2023-01-30] #as an insert this would not work because there is already two values, but it is a replace so it is fine!

adding at position

match
$book has id 'b1';
$page-component has id 'pc1';

insert #at position
$book has tags[0] 'newTag'; #adds tag i
$page-component (components[1]: $comp7) #adds a component edge between current 0 and 1 

Removing at position

match $book has id 'b1', has tags[0] $firstTag, has tags[3] $forthTag
delete
$book has $firstTag;
$book has $forthTag;

Removing by filter

match $book has id 'b1', has tags $filteredTags;
$filteredTags in ['sci-fi','history']
delete
$book has $filteredTags ;

replacing at position

match
$page-component has id 'pc1';
replace
$page-component (components[2]: $newComponet); #unlnks current 3rd component and links $newComponent in its position