Common database interface?

We have a bunch of database for various objects in Oscar. At the very least, various group libraries:

small groups
transitive groups
primitive groups
perfect groups

But certainly we have or will have access to many more, e.g.

matroids with various properties
polytopes with various properties
number fields with various properties
character tables of various groups
... and surely much more (feel free to add comments indicating any DBs you plan to add or would like to see added)

I think it would be good to spend some thoughts on how to access these in a consistent way. Of course our different DBs have quite different content and needs, so I don't expect this to be completely uniform, and that's fine. But we can still try to at least get some consistency in there. Let's discuss how that might look.

For now, I am mostly concerned about read access. Of course storage and writing are also important, but have a much bigger scope (and maybe involve MaRDI as well)

I'll make a few suggestions here, just to get a discussion started. Note that a central observation is that databases and dictionaries are in many ways similar, so it bears thinking about using similar conventions to access them.

1. Representation of the databases

Each "database" could either be represented by an object, or by singleton type. Example: TransitiveGroupsDB (or TransitiveGroupsLibrary, or whatever).

For some reason, the latter appeals more to be, but this is purely subject, I can not yet formulate good reasons either way. But it may not be necessary to formalize this at all; what to me really counts is that the user should have some Julia identifier to "represent" that database. Whether this is transitive_groups() or TransitiveGroupsDB is not so important to me in the end.

2. Database lookup via ids / identifiers (to objects)

Each object in a database is index with a unique "key" or "identifier". If one knows one of these, it should be possible to retrieve an object with this.

old: transitive_group(6,1)
new: get(TransitiveGroupsDB, (6,1))
and/or: TransitiveGroupsDB[6,1] or TransitiveGroupsDB[(6,1)]

3. Identification: Database lookup via objects (to ids)

For some databases, it may also be useful / possible to be able to take an object and "identify" it in the database. I.e. determine if an object which is in some to be defined sense equivalent to the given object is in the DB, and if so, return its ID.

old: transitive_identification(G) resp. (degree(G), transitive_identification(G))
new: identify(TransitiveGroupsDB, G)
and/or: lookup, identifier ...
also: hasidentifier(TransitiveGroupsDB, G) or so to test if the identification was already computed and stored in the object

4. Queries

The main functionality of a database usually is to run queries on it, with various filters to "select" elements. The general thing would be to return an iterable object of results (this could be a Vector, or a "proper" iterator, which then can also be lazy). The query filter should be in a way so that the DB can optimize it, i.e. not a pure Julia function...

old: all_transitive_groups(degree => 4, order => 10:50, isabelian, !iscyclic) selects all transitive groups of degree 4, with order between 10 and 50; which are abelian but not cyclic.
new: not completely sure what makes most sense given the constraints... some ideas:
- findall(TransitiveGroupsDB, degree => 4, order => 10:50, isabelian, !iscyclic)
- findall(TransitiveGroupsDB, (degree => 4, order => 10:50, isabelian, !iscyclic))
- findall(degree => 4, order => 10:50, isabelian, !iscyclic, TransitiveGroupsDB)
- findall((degree => 4, order => 10:50, isabelian, !iscyclic), TransitiveGroupsDB)
- ...
basically, a list of predicates that the object should satisfy, in a way that makes it possible to programmatically parse it, representing a disjunction of clauses; the clauses I use here are either predicate (names), their negations, pairs mapping a predicate to a value or list of values desired for it. Of course this does not cover everything that's possible by far, but in my applications usually is enough. But (a) this can be extended, (b) we don't even have to prescribe this for every DB; each DB could use its own conventions (though of course it'd be nice if they behaved similarly, but there is no reason to stick to that slavishly)
if desired, we could also offer macros to make this even nicer, e.g. @findall(degree == 4 && order in 10:50 && !iscyclic && isabelian, TransitiveGroupsDB). But let's not overengineer for now :-)

5. Specialized queries

Other APIs, like e.g. findfirst, could have default implementations of the above. But of course specialized methods can be provided as an optimization. E.g. often one just wants to get a count of objects with a certain property.

old: number_transitive_groups(6) or length(all_transitive_groups(degree => 4, isabelian))
new: count(TransitiveGroupsDB, degree => 6) or count(TransitiveGroupsDB, degree => 4, isabelian) or ... well, basically the same variations as in the previous point are possible

6. Querying availability

It can also be useful to query if certain subcollections are "complete": e.g.

does this database contain all transitive groups of a given degree? Resp. "For which degrees are the transitive groups available?"
for which orders/sizes are all groups available? / are the groups of a given order available?
are all polytopes with property X in the database? / for which values of SOME_PROPERTIES are all polytopes available?
etc.

Suggestions, additions, alternatives welcome. Note that not every DB would have to offer all of these, nor would we have to do all of this at once. But if we can agree on some rough plan, I'd try to implement it immediately for the group libraries (and then this could be changed and refined over time).

CC @fieker @micjoswig @lkastner @benlorenz

1. Representation of the databases

Each "database" could either be represented by an object, or by singleton type. Example: TransitiveGroupsDB (or TransitiveGroupsLibrary, or whatever).

For some reason, the latter appeals more to be, but this is purely subject, I can not yet formulate good reasons either way. But it may not be necessary to formalize this at all; what to me really counts is that the user should have some Julia identifier to "represent" that database. Whether this is transitive_groups() or TransitiveGroupsDB is not so important to me in the end.

I would prefer the former, since I feel it would be a bit easier to swap out the "default" database for something that I only have locally? Like replace get(TransitiveGroupsDB, ...) by get(MyLocalExtendedTransitiveGroupsDB, ...)? Probably I am misunderstanding so let me ask two questions for clarification. If TransitiveGroupsDB is a type, then get(TransitiveGroupsDB, ...) would first compute the path via get_path(TransitiveGroupsDB)? Or where is the actual data about the information of the database stored and how would one swap the default version for a local version of the DB on the fly?

In reality, get(TransitiveGroupsDB, ... could do anything. Right now, it'd just map to the corresponding GAP functions, and would not do anything at all about computing paths etc.

Using a custom version of a database is something that is highly specific to how each database is implemented, I don't think we can do something generic about this. But sure, it would be useful to allow custom variants of databases, but e.g. for the transitive groups I have no idea at all how that should work. For a more traditional database (which accesses a SQL or Mongo or whatever database on a server or in a file), of course one could think about specifying an alternate URL / file path. But that could be done either by a set_server_url(MyDatabase, url) method OR by allowing something like Oscar.MyDatabase = OtherInstanceOfMyDatabase(url)) resp. get(OtherInstanceOfMyDatabase(url), ...).

So I think this is a good point (the desire to allow specifying alternate variants of a DB), but it seems to me that this can be done in either model.

We have the following interfaces to the polydb in Polymake.jl:

julia> db = Polymake.Polydb.get_db();

julia> collection = db["Polytopes.Lattice.SmoothReflexive"];

julia> results = Polymake.Polydb.find(collection, "DIM"=> 3, "N_VERTICES"=> 6);

julia> for x = results println(Polymake.getname(x)); end
F.3D.0001
F.3D.0013
F.3D.0014
F.3D.0016

julia> query = Dict("DIM"=>3, "N_FACETS"=>5);

julia> results = Polymake.Polydb.find(collection, query);

julia> typeof(results)
Polymake.Polydb.Cursor{Polymake.BigObject}

The cursor is a lazy iterator backed by the corresponding MongoDB server, this still needs to be wrapped for Oscar and needs some print methods and related stuff (#895). The server that is used can be changed with an environment variable (this is used for the CI) but it would be easy to add extra parameters to the get_db function for that as well.

We also have an alternative macro-based interface that is inspired by Query.jl:

julia> db = Polymake.Polydb.get_db();

julia> results = db |>
       Polymake.Polydb.@select("Polytopes.Lattice.SmoothReflexive") |>
       Polymake.Polydb.@filter("DIM" == 3, "N_VERTICES" == 8) |>
       Polymake.Polydb.@map() |>
       collect
7-element Vector{Polymake.BigObject}:
 Polymake.BigObjectAllocated(Ptr{Nothing} @0x00000000028c5320)
 Polymake.BigObjectAllocated(Ptr{Nothing} @0x000000000abd7b40)
 Polymake.BigObjectAllocated(Ptr{Nothing} @0x000000000a6d7bf0)
 Polymake.BigObjectAllocated(Ptr{Nothing} @0x000000000a431470)
 Polymake.BigObjectAllocated(Ptr{Nothing} @0x000000000bcaf290)
 Polymake.BigObjectAllocated(Ptr{Nothing} @0x00000000098fb670)
 Polymake.BigObjectAllocated(Ptr{Nothing} @0x000000000a1ba460)

Most databases will have some kind of level between the objects and the database like collections for MongoDB or tables for SQL databases. So I think it would make sense to model the interface like that as well, i.e. GroupsDB.TransitiveGroups or db["TransitiveGroups"].

cc: @alexej-jordan who wrote these interfaces

We are actively working on these things. I fully agree that it is desirable to have all of the above in a database for OSCAR.

Yet we are after a comprehensive solution, which also scales computationally. Lars' serialization efforts are a necessary first step. In fact, we are in the middle of (fingers crossed!) hiring a programmer for MaRDI whose main task will be exactly this. In fact, everything of the above (and more) has been a major motivation for me to get involved with MaRDI in the first place.

One comment: a database is a database is a database. Phrased differently, if the stuff you want to do cannot be done by mongo and friends, then your problem is not a database.

For now, please be patient. We will come back to this after the MaRDI hiring suceeded.

Phrased differently, if the stuff you want to do cannot be done by mongo and friends, then your problem is not a database.

I strongly disagree with this. I have databases that have infinite content, and Mango cannot handle that, or at least in a way that would be useful (assuming that the Mongo DB stores each mathematical object as one database object). Even for the finite databases: the small groups database (which is infinite, but let's suppose we restrict to a finite part) is far too big for naive storage in Mongo. We have a small semigroups database that stores over 52 trillion objects (and thanks for clevrer encoding does it with less than 1 bit per object). No way a Mongo instance (or any classical database) can handle this on my laptop in a meaningful way.

Lars' serialization efforts are a necessary first step.

I also disagree with this. From my point of view, there is zero motivation to re-encode the above databases.

All in all, I think trying to force all databases in a single format, be it Mongo or something else, is useful, possible, or desirable. It's a deadend.

@benlorenz thanks that's insightful, and doesn't seem far from what I sketched above.

BTW, I just noticed I forgot a category of "availability" queries, so I added a note on this. But no proposal yet how that could look, haven't had time to think about it properly. Perhaps modelling this via "collections" (as in your example) would be a way to go about it. Dunno...

I strongly disagree with this. I have databases that have infinite content, and Mango cannot handle that, or at least in a way that would be useful (assuming that the Mongo DB stores each mathematical object as one database object).

I am not denying that this is useful to have, but for me this is not a database but something else. First of all, for me a database is finite. I do not think it is useful to abstract things too much, because we want to exploit the advantages of databases like mongo. We actually need this (like indexing in more than one way, to make fast retrievals of small subsets of billions of objects); that's a database problem.

In math there are scenarios (e.g., for your semigroups) where you can store data in some handcrafted ways. But this typically does not lend itself to generalizations, by definition. You may call it a "database", but that's just blurring essential differences, and I don't think that's useful.

Of course, there is nothing wrong with making interfaces look similar for other software components looking similar to databases.

Regarding metadata/availability: We store an info document for each collection which usually describes the what objects it contains, references for the data, and lists the authors of the collection. And there is a json schema for each collection which specifies all present fields.

julia> Polymake.Polydb.info(collection)
SECTION: Polytopes
A collection of families of polytopes

SECTION: Polytopes.Lattice
This database contains various classes of lattice polytopes.
Maintained by Andreas Paffenholz, paffenholz@opt.tu-darmstadt.de, TU Darmstadt

    COLLECTION: Polytopes.Lattice.SmoothReflexive
    Smooth reflexive lattice polytopes in dimensions up to 9, up to lattice equivalence. The lists were computed with the algorithm of Mikkel Oebro (see [[http://arxiv.org/abs/0704.0049|arxiv: 0704.0049]]) and are taken from the [[http://polymake.org/polytopes/paffenholz/www/fano.html|website of Andreas Paffenholz]]. They also contain splitting data according to [[https://arxiv.org/abs/1711.02936| arxiv: 1711.02936]].
    Authored by 
        Andreas Paffenholz, paffenholz@opt.tu-darmstadt.de, TU Darmstadt
        Benjamin Lorenz, paffenholz@opt.tu-darmstadt.de, TU Berlin
        Mikkel Oebro
    Fields: AFFINE_HULL, CONE_DIM, DIM, EHRHART_POLYNOMIAL, F_VECTOR, FACET_SIZES, FACET_WIDTHS, FACETS, H_STAR_VECTOR, LATTICE_DEGREE, LATTICE_VOLUME, LINEALITY_SPACE, N_BOUNDARY_LATTICE_POINTS, N_EDGES, N_FACETS, N_INTERIOR_LATTICE_POINTS, N_LATTICE_POINTS, N_RIDGES, N_VERTICES, REFLEXIVE, SMOOTH, SELF_DUAL, SIMPLE, TERMINAL, VERTEX_SIZES, VERTICES, VERTICES_IN_FACETS, VERY_AMPLE, ALTSHULER_DET, BALANCED, CENTROID, DIAMETER, NORMAL, N_HILBERT_BASIS, IS_PRISM, IS_PRODUCT, IS_SKEW_PRISM, IS_SIMPLEX_SUM, PRISM_BASE, PRODUCT_FACTORS, SIMPLEX_SUM_BASES, SKEW_PRISM_BASES

The available fields can also be accessed directly:

julia> collection = db["Matroids.Small"]
Polymake.Polydb.Collection{Polymake.BigObject}: Matroids.Small

julia> Polymake.Polydb.get_fields(collection)
27-element Vector{String}:
 "DUAL"
 "N_BASES"
 "TUTTE_POLYNOMIAL"
 "SERIES_PARALLEL"
 "N_FLATS"
 "SPLIT_FLACETS"
 ⋮
 "TERNARY"
 "REGULAR"
 "TRANSVERSAL"
 "IDENTICALLY_SELF_DUAL"
 "BETA_INVARIANT"

Just looked at this issue because someone mentioned it during today's Zoom meeting. I like what @benlorenz described in his first post above: especially the idea of assembling all parts of a query into a single object before sending it off to the "database" (DB) for execution; also the interface which was inspired by query.jl looks simple and powerful -- here I presume that until the call to collect there was no actual processing of DB entries, but more the construction of a compound "query-object" to be handed to the DB. One doubt I have about the use of Dict to create the initial query is that the validity of the arguments is deferred to the call to find (which was only one later later... in this example). Maybe there could be a "query syntax/semantics verifier" which produces a verified query or throws an error?

oscar-system / Oscar.jl