vaticle / typedb

TypeDB: the polymorphic database powered by types
https://typedb.com
Mozilla Public License 2.0
3.72k stars 337 forks source link

TypeDB 3.0 Roadmap #6764

Open flyingsilverfin opened 1 year ago

flyingsilverfin commented 1 year ago

Problem to Solve

We collect the agreed list of changes and requirements that will be in the first version of TypeDB 3.0.

Changes

API

Driver

TypeQL

Value restriction:

Require further discussion:

Relation implementation

Changes proposed and rejected:

haikalpribadi commented 1 year ago

Let's make sure each of them is documented properly in an issue, @flyingsilverfin

flyingsilverfin commented 1 year ago

Yes @haikalpribadi that's what the colons are for :D I have to get to that next

flyingsilverfin commented 1 year ago

Internal changes

A ? indicates not yet fully discussed.

TypeQL

RPC

Pattern & Resolvables

Concepts

Traversal & Reasoner

lveillard commented 1 year ago

Thank you guys for working on this and sharing it with us. The two key features that extremely limit our use cases and I'm missing in this list:

Optionals/fetch

https://github.com/vaticle/typedb/issues/6322 Including a way to have optional played roles also, not only attributes.

Vectors and ordered lists

I see they've been discarded :S, but there is no simple workaround for this: https://github.com/vaticle/typedb/issues/6327

Vectors

We don't need them as a particular attribute types, maybe a @sortable or @indexable when defining relations would work.

Storing something like this in typeDB is really hard, and mutate it (add items at particular positions) in a performant way is near to impossible image

Ordered lists

This also includes ordered lists with repeated values which are really hard to store in typeDB. Ex: [1,2,2,3,7,2] or ['blue', 'green', 'green', 'red']

lveillard commented 7 months ago

I think the proposal is only to remove repeated roles & players

role1: $player1, role1: $player1

While this would keep working:

role2: $player1, role2: $player2

So it's not about removing the cardinality MANY of roles, but removing the possibility of a player to play the same role multiple times.

So basically, no repetition. Which I agree is not the most common use case out there. But it does happen.

Btw a workaround for this in the new format would be to create an intermediary entity "event" for instance so instead of A<>B & A<>B we would have to do A<>EV1<>B & A<>EV2<>B

brettforbes commented 2 months ago

Hi All,

After speaking to Haikal, there are good reasons to move from the Concept API to Fetch, particularly speed. At present it takes between 2-5 secs to retrieve an object from TypeDB and transpile it to valid Stix JSON. This is mainly due to all of the network roundtrips that have to be done, so clearly one fetch query will be more effective.

The advantage of our current system is it is shape-based, so I can handle all JSON objects using the same ORM, the disadvantage is speed.

The new approach does mean a lot more code, since we have to build quite long Fetch statements for each individual object (e.g. 16-44) lines for each of our 85 objects, and then build the transpile code (from returned Fetch JSON to Stix JSON). This figure assumes a single main object, 4 lines, and then 3-11 optional sub objects with relations, with 4 lines each, if we use the class hierarchy. But the benefit will be far greater speed, totally agreed.

We probably wont be able to make this move for some months, due to resourcing, but we agree it will be worth it. At the same time we can update our 2.500 lines of schema code to v3. This will place us in good position to add on another 50-80 cybersecurity objects (e.g. SBOM's, Vulnerabilities, Risk etc.)

Onwards and upwards for TypeDB and our cybersecurity application!!

lveillard commented 2 months ago

I hope we will get the same tree structure for mutations. Batch mutations and optional mutations are currently a nightmare, while queries with fetch are so smooth.

A point of enhancement could be to be able to use multiple match fetch in the same query, and same for the mutations, instead of having a single entry point.

This is possible in the nested branches, we can open multiple ones and asign them to different keys, but it is not possible to have multiple keys at the root level.

lveillard commented 2 months ago

Another key conceptual blocking point in mutations for us is how cardinality MANY is handled. Whenever the match clauses start doing permutations, the insert / delete are run as in a FOR loop.

This issue has an example of one insertion that is run N times against intuition: https://github.com/vaticle/typedb/issues/6902

In 3.0 I would love to see $vars being aware of their cardinality. The way that dgraph executes this type of mutations is really intuitive, each variable holds and array of iids, so if a match does something like this

match
$jobPosition isa jobPosition has id 'frontendDeveloper';
$candidate isa Person, has name 'Junior Peter':
$allInterviewers isa Person, has departMentName 'IT':

insert
$selectionProcess ( candidate: $candidate, job: $jobPosition, interviewers: $allInterviewers) has id 'selectionProces1':

This would be run a single time and create a sinfle selectionProcess as expected.

Alternatives: a) in order to enable FOR loops as they happen now, new syntaxis for loops could be created, which are more rare cases.

b) Another alternative would be to indicate the type of cardinality in the roles when defining the schema, so we now which things are treated as arrays and store multiple iiids in the $var, and which things follow current behaviour

c) Yet another alternative could be to clearly define array variables, for instance doing []allInterviewers isa .... instead of $allInterviewers isa ....

brettforbes commented 1 month ago

v3.0 is looking awesome, but can you also detail TIME and GPS please

V3.0 a pretty massive rewrite, and in fact we may probably reengineer the schema, since originally Tomas adopted the Vaticle style guide, and made all of the property names different from the TypeDB ones. The consequence of this is that Fetch statements must be super long to include every property, every sub-objects and all of its properties. If the variable names are the same then Fetch would be more powerful and concise.

Still, the powerful new capabilities of v3.0 make it worth this re-engineering, as long as TIME and GPS are sorted. Please provide architectural best practice for these two, thanks

lveillard commented 1 month ago

So after lot of thought Im changing my wishlist priorities. My key needed feature is being able to share $vars between different streams. This would fix almost every issue we are facing with mutations and is something enabled in most databases.

As an example:

startTX

   insert
      $b isa Book, has id 1
    ---
    match
       $allAuthors isa Author
    ---
    insert
      $authorship ($book, $allAuthors) isa Authorship

endTx
brettforbes commented 1 month ago

Can LLM Vectors be stored and indexed?

This would be very useful, to store LLM vectors along with entities or relations. Can it be done using structs or lists somehow? LLM are going to keep getting bigger, so it'll have to be addressed at some stage. Need to connect TypeDB to natural language meaning, which is a vector in the case of LLM's.

sjpritchard commented 3 weeks ago

Can LLM Vectors be stored and indexed?

This would be very useful, to store LLM vectors along with entities or relations. Can it be done using structs or lists somehow? LLM are going to keep getting bigger, so it'll have to be addressed at some stage. Need to connect TypeDB to natural language meaning, which is a vector in the case of LLM's.

It's not just vector storage, but the Approximate Nearest Neighbour search that is also required.