kuzudb / kuzu

Embeddable property graph database management system built for query speed and scalability. Implements Cypher.
https://kuzudb.com/
MIT License
1.15k stars 82 forks source link

Rework Catalog and TableStatisticCollection #2495

Open ray6080 opened 7 months ago

ray6080 commented 7 months ago

The problem

  1. for any writes to the catalog, we need to maintain both read and write version of the whole catalog, basically duplicate the memory overhead unnecessarily.
  2. checkpoint of the catalog file triggers rewritten of the whole file, which is also unnecessary in almost all cases.
  3. the two version design also exists in TablesStatistics. while they basically duplicate the same logic without sharing the same architecture.
  4. there is lack of built-in dependency management in our current catalog. RelGroup is also modelled as a Table, which is not the correct level of abstraction, as it should be the parent of a bunch of rel Tables. same for rdf graph.

Solution

In memory data structures

  1. add the abstraction of MetaEntry. An entry can be one of following types:
    • NODE/REL TABLE SCHEMA
    • TABLE/SCALAR/AGGREGATION FUNCTION
    • TABLE GROUP (i.e. REL GROUP)
    • RDF Graph
    • TABLE STATS
  2. each entry should maintain its own write version. (which can be extended to versioned chain if multi-version support added later)
    class MetaEntry {
    oid_t oid;
    MetaType tableType;
    string name;
    std::vector<std::unique_ptr<MetaEntry>> children;
    std::vector<MetaEntry*> dependencies;
    string comment;
    bool isDeleted;
    std::unique_ptr<MetaEntry> writeVersion;
    }
  3. dependencies are explicitly stored as a vector of MetaEntry pointers.

On disk storage Add the abstraction of MetaWriter and MetaReader. Internally, they make use of Serialize and DeSerializer to read and write meta entries. Each entry starts with an offset in file, which is maintained inside a DiskArray, DiskArray<PageCursor> metaDA.

semihsalihoglu-uw commented 5 months ago

@ray6080: I had a different solution for trying to make all updates to datalog objects transactional. My solution was modeling every database object as a system-level table (e.g., _Sys_Catalog), which are not visible to the user. And we update them with the same logic of updating tables. This is briefly mentioned in "Further Considerations" part of issue 2529.