Add database capabilities

inkytonik commented 3 years ago

The plan is to add command-line capabilities for access to databases. This will require an analysis of the kinds of capability we need (e.g., read, update, maybe fine-grained to tables or parts of tables) and thinking about suitable database targets.

We should also think about the interaction between the database capabilities and the others. E.g., can we use this approach to ensure that db info can't be written to a file, for example. Is that a kind fo restricted capability that the user can grant?

The aim is to investigate and demonstrate how the Cooma approach can help to avoid some of the issues that arise in database applications such as web sites. E.g., the accidental reveal of secret information.

nhweston commented 3 years ago

I will outline what I intend for a V1.

Scope

Only SQLite will be supported.
A command-line argument corresponds to a single table in a database.
The only capability method available will return all rows from that table.

Programming with database capabilities

Programmers intending to accept a database table must provide its table definition in the form of a record type. Each field of the record type corresponds to a column of that table.

For a table definition to be valid, all fields must be of type String. All column values returned from the database will be converted to strings. (Note that NULL will be converted the empty string.) In the future, we will allow other types to be used. A preruntime check will ensure that the table definition is compatible with the actual database table. Table definitions may omit columns that are present in the table, but they may not include columns that are not present.

As an example, the following is a valid table definition:

type Student = {
    id : String,
    name : String,
    wam : String
}

…and is compatible with a table with the following SQL definition:

CREATE TABLE student (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT NOT NULL,
    gpa DECIMAL(3, 2),
    wam INTEGER
);

(Note that the table definition omits the gpa column.)

Unlike all capabilities that exist in Cooma so far, there is a capability type constructor for databases, defined as:

type Table = fun (A : Type) {
    all : () Vector(A)
}

Note that A must be a valid table definition, and this will be enforced at compile time. (However, its compatibility with a database table can only be checked at preruntime.)

Table::all, when applied, will return all rows of the table.

Here is an example of a simple program that returns all rows from a table:

fun (students : Table(Student)) students.all()

There is a problem with parameterised capability types evident in this example that will need to be addressed – Where is the Student type defined? Currently, a function expression must be at the top-level to be treated as the program itself, but this leaves no room for definitions that a programmer may want to use in the types of program arguments.

Of course, the programmer could do something like this:

fun (students : Table({
    id : String,
    name : String,
    wam : String
})) students.all()

…but this is certainly not ideal.

Command-line arguments

SQLite databases are stored as files, so command-line arguments can simply provide a path to the file, followed by the table. For example:

run -r program.cooma ./database.db:student

Before execution, Cooma will check that the database and table exist, and that the table definition is compatible.

Implementation

Primitives.scala is becoming quite long and I think it is appropriate to place file, HTTP, and database capabilities into separate files.

A challenge in implementing database capabilities is managing the connection, which is essentially state. The connection needs to be opened during the preruntime phase to check the compatibility of the table definition. The same connection should persist and be reused until the program exits. This breaks the current assumption that capabilities are essentially stateless functions. The easiest solution would be to have the database connection form mutable state within Primitives.

The Java SQL library is sufficient for our purposes. Table definition compatibility can be checked using the SQLite table_info pragma.

inkytonik commented 3 years ago

All of this sounds good to me. It seems in keeping with the way the other types of capability argument work and a good restricted first step to see how things go.

Regarding the problem with needing to define Student before the main function, it seems that we need to support any programs that have function type, rather than just ones that are function expressions. (And this seems like a good thing anyway, even ignoring the DB capabilities.) I haven't thought through the details, but maybe it's possible to adapt the main program analysis and compilation to handle this kind of program, driving everything off the type of the program rather than its form.

Splitting the primitives up sounds like a good idea.

I'm not sure of the best way to handle the DB state. Keeping it in Primitives seems ok, but does that handle a program that might want to access more than one DB? I guess you can keep a table of state accessed by DB and table name.

nhweston commented 3 years ago

Regarding the problem with needing to define Student before the main function, it seems that we need to support any programs that have function type, rather than just ones that are function expressions.

I take it you mean that if a program evaluates to a function, then that function is what accepts the command-line arguments? I like this idea.

I'm not sure of the best way to handle the DB state. Keeping it in Primitives seems ok, but does that handle a program that might want to access more than one DB? I guess you can keep a table of state accessed by DB and table name.

Yes, I was thinking of having the state be a mutable.Map[String, Connection]. I intend that for V1 we will only handle the case where there is at most one Table parameter. I think in the future we will want to have a command-line argument correspond to a database, and the parameter be a record whose fields correspond to tables.

inkytonik commented 3 years ago

I take it you mean that if a program evaluates to a function, then that function is what accepts the command-line arguments? I like this idea.

Yes, that's right. I suggest for now you stick with your workaround of inlining the Student type in the argument position. I'll add a note to look at the more general solution when I can.

Yes, I was thinking of having the state be a mutable.Map[String, Connection]. I intend that for V1 we will only handle the case where there is at most one Table parameter. I think in the future we will want to have a command-line argument correspond to a database, and the parameter be a record whose fields correspond to tables.

This all sounds good.

nhweston commented 3 years ago

For V2, the command-line argument will take the form:

Database({
    table0: Table({
        field0: String,
        field1: String,
        …
    }),
    table1: Table({
        field0: String,
        field1: String,
        …
    }),
    …
})

The type constructor definitions are:

Database(A) = A
Table(A) = { all : () Vector(A) }

Notice that Database is just the identity type constructor. Its purpose is to ensure the compiler recognises the command-line argument as a database.

inkytonik commented 3 years ago

LGTM. Sometime we might want to add a way to get the command-line info from a file. These DB capability specifiers are going to get long... I'll add an issue.

nhweston commented 3 years ago

Just had a thought.

With the design I've put forward so far, each table essentially has two names – 1) the actual name of the table in the database provided in the command-line argument, and 2) the name of the field corresponding to the table, in the top-level type.

It might make more sense to have the user provide only the database path, and for tables to be specified in the type. If the field name in the top-level table corresponding to a table is foo, then the table (with actual name) foo will be accessed by that primitive.

I suppose one could argue there are cases where this might not be desirable. One might want to have a program that could apply to different tables. For example, a program that accepts a table and returns a list of IDs, or a row count. However, most database libraries tend to require a full schema, with exact table names.

I'm thinking I will take this approach for now since it's easier to implement.

inkytonik commented 3 years ago

Seems like a good idea to me.

inkytonik commented 3 years ago

The plan now is to add insertion, update and delete.

nhweston commented 3 years ago

One issue with getById (and also delete) is the primary key.

My first thought was to have an identity function named PrimaryKey, used by the compiler to identify the primary key. For example:

Table({ id : PrimaryKey(Int), name : String })

The top-level argument check would need to verify that exactly one column is marked as the primary key.

I realised that this doesn't solve the entire problem. We need the type above to have the following field:

getById: (Int) Option({ id : Int, name : String })

How does the semantic analyser recognise that the parameter type is Int? It can't, because as far as the semantic analyser is concerned, PrimaryKey(Int) =:= Int.

My next thought is that the type of the primary key could also be a type parameter to Table, for example:

Table(Int, { id : PrimaryKey(Int), name : String })

This would require Table to be defined as:

(K : Type, A : Type) {
    …
    getById: (K) Option(A)
    …
}

I can confirm that such a definition currently causes the semantic analyser to crash.

Even if this worked, there's still a significant shortcoming. The compiler still needs to be able to identify the primary key, so the PrimaryKey marker is still necessary. The semantic analyser cannot check that the field marked with PrimaryKey is of type K. I suppose this could be done by the top-level argument check. But this is starting to feel like misplacing responsibility.

One way this could work is to enforce that all primary keys must have a particular name (e.g. id). In both Slick (Scala) and Prisma (NodeJS), this is the only feasible type-safe way (as far as I can tell) to define a generic getById method. SQLAlchemy (Python) has a get (by primary key) method, but this relies on dynamic typing.

The getById functionality in most ORMs is realised using a function or some data corresponding to an SQL WHERE clause, but that requires a query DSL which we've decided we don't want to get into.

I am thinking we don't have much of a choice but to enforce that primary keys can only be of type Int. Then we can have the PrimaryKey simply be an alias for Int, but recognised by the compiler as specifying the primary key.

inkytonik commented 2 years ago

Sorry for not having time to look at this. I don't think I appreciate all the nuances, so it's probably best that we discuss at tomorrow's meeting. Or maybe you've solved it in the meantime? :-)

inkytonik commented 2 years ago

Plan is to go ahead with first version assuming integer primary keys.

inkytonik / cooma