Design and implement type system in logical layer

aannleax commented 1 year ago

As part of #70, we planned to have a separate type system on the physical layer and the logical layer. While the former deals with low-level types like u32 or floats, the latter deals with more abstract concepts like "String", "Integer" and so on. The task is to implement this functionality.

Some more considerations:

This would require to have a translation from the logical type into a TrieSchemaElement which currently represents the type in the physical layer
Some type analysis on the program level is necessary. E.g. Joins can only be performed on columns of the same logical type. This would require an implementation of type propagation through a predicate position graph
One should think about having a uniform system of representing values. Currently the CSV-loading code for example is independent from the rule parsing code.
Type declarations should be part of the rule language: For example in the @source statements
Users should be allowed to not specify a type. This would then be automatically deduced while loading the files. This could be tricky because of the type analysis that needs to be performed (E.g. one might find that constraints of the type analysis are violated while loading the file).
Users should be allowed to overspecify a type. E.g. a user might say that a column should be of type Integer(32) even though we might not differentiate between integers of different sizes in the logical layer

mkroetzsch commented 1 year ago

Seems a bit underspecified for a single issue. So let me first clarify the key aspects for the basic design:

The type system should chiefly be based on RDF (extensible XML Schema types + IRIs). So types are identified by IRIs. Common types like xsd:float have a clear meaning. For terms that are either IRIs or datatype literals (the schemaless default of RDF), we ca use rdfs:Resource. Note that owl:Thing is the type of all "abstract" constants in OWL, which excludes literal values. Also note that RDF has a type anyIRI, which we might want to avoid for now.
Some types have subset relations (e.g., int and byte) that we must support if such types are considered. However, an initial set of "most useful" types can be chosen to have disjoint value spaces (int, double, owl:thing), with the exception of rdfs:Resource being above everything else (but not supporting type generalisation during reasoning would still be fine initially in this case too).
The core design is that all predicates have sorted signatures using these types. By default, their only type in all parameters would be rdfs:Resource, handled as arbitrary terms are in VLog. To get more specific types, declarations are necessary, which could be user-provided (for EDBs), inferred (for otherwise undeclared IDBs, based on data flow in a set of rules), or built in (for built-in predicates). We currently allow different predicates that share the same name (IRI), distinguished only by their arity. One could still allow this with sorts, but probably only for different arities (i.e., not allow two different sort declarations to co-exist for one arity and one predicate name, which would make type inference very confusing).
A rule is "well-typed" if every variable can be assigned a type that is "compatible" with the atoms it appears in. For body variables, this type must be the supremum (in the subtype hierarchy) of the types of all predicate positions in regular non-negated atoms. It is required that this supremum (most general joint type) then is a subtype of the types of all negated, built-in, or head atom positions it occurs in. For existential variables (and maybe other function-style terms), the type is predefined (owl:Thing for existentials), and again must be a subtype of the head positions it occurs in.
Types on the logical level determine (1) if a rule set is well-typed and (2) how data is represented in tables and whether dictionaries are involved in this (and if yes, what operations they support).

Note that, when implementing more of the XML Schema types, one also has to add some internal types that ensure that the "hierarchy" is actually a sup-semilattice (some pairs of types have several upper bounds in XSD but no most specific one). But that only comes up when considering much more specific types than initially planned.

Then, for the above questions in their order:

Yes, kind of. One might want to avoid a (presumably slow) conversion of EDB data into logical objects as an intermediate stage to parsing, but some cases would naturally require such a conversion (e.g., for constants in rules).
Yes, this is the notion of well-typed programs. The first implementation only needs to check if the (pre-defined) types are compatible on this level. Type inference (for predicates not typed explicitly) could come later. Note that allowing rules to be added might affect inferred types (making them wider, possibly conflicting with the supremum-based type computation that was used to make some previously known rules well-typed.
This item mixes up several things. Parsing is a separate issue (a single parsed term always should have a specific type that is not depending on the schema; we will determine later if this specific value can be converted to the type required in that column). The possible differences between CSV and RDF representation of typed values is also an unrelated concern. Code for interpreting RDF-style term strings into values should be shared among all parsers that support such syntax, but one can also have parsers that deliberately use other forms for representing the same values (e.g., a JSON parser would have its own double representation).
Yes, that is part of the basic design. The source declarations already give arity, so one can easily imagine how to generalize this. For predicates that are not from a source, one could have a separate form of declaration -- or fully rely on inferred types from the start (with a small initial type set, inference might be easy).
Maybe, or maybe not. I would keep this for later, since it is a harder problem. We can always consider "no declaration" to mean "rdfs:Resource", but note that EDBs always need to be declared with at least an arity anyway, so not saying anything is not an option now. I would rather consider [2] a syntax-level alias for [rdfs:Resource,rdfs:Resource] instead of doing anything "smart" that is a source of confusing errors when it behaves unexpected.
We can of course support distinct types with the same internal structures and handling. But that's not really a design decision, is it? RDF does indeed have a number of integer types down to bool, which would maybe not all be implemented differently in the physical layer.

monsterkrampe commented 1 year ago

Related discussion: #185

monsterkrampe commented 1 year ago

A basic type system with Any, String, Integer and Float64 has been implemented. Since there is no type casting when reasoning yet, only Any and String are compatible because they happen to have the same physical representation.

Still, since this marks a basic milestone of the type system development, I hereby close this issue. We should open new issues for open points and use the existing discussion for clarifying additional questions.

knowsys / nemo

Design and implement type system in logical layer #85