nd-db metadata v2 - Githubissues

Right now (v0.8.0) lazy-docs simply use the document IDs from the database index, to create a lazy seq of documents. This works out fine. But the drawback is, that the initial index has to be in memory.

The index is stored binary, as part of an EDN document. All of this has to be parsed into memory, before the index can be used to get the lazy sequence of database documents.

A more lazy way of reading the database documents, would be to store the index entries as separate lines of text in the .nddbmeta files. Then a line-seq over the index file would remove the need of reading all of the index into memory first.

This doesn't matter for small databases, but for huge databases with millions of documents, it'll make a huge difference to not store all of the index in memory "just" to go through it one by one.

"Since the processing order doesn't matter to me why don't we just make a line-seq over the database file itself?" you ask. "Good question" I say.

The answer lies in the potential need for parallelizing your workload (i.e. processing pipeline) via clojure.core.reducers or other means. To properly parallelize, you need to have the full set of input data realized first (i.e. stuffed into a vector). You'll probably run out of memory in a snap, if you try to stuff millions of huge documents into memory.

Instead you read only the ID's from the index, stuff them into a vector, and parallelize by querying the database, ID by ID. This will work, but if you have a lot of CPU cores, but a slow harddisk, the read speed of the disk will probably become the bottleneck. You need a snappy SSD!

On the other hand, if you don't parallelize, you can easily just line-seq over the nd-db datab ase file itself, and do your document processing one by one.

The result of this task should be to:

(BREAKING) change the format of the .nddbmeta files, to contain the general meta data on the first line, and the database index on the following lines (one entry per line)
lazy-docs with a nd-db parameter, realizes the :index and returns a lazy seq of the docs
a new nd-db.index/reader can be used to read the :index lazily (instead of having to realize it fully before returning the lazy seq of documents). But this only for the new v0.9.0+ versions of the .nddbmeta files. This is ideal for huge databases with millions of documents.

NB: This new version of the lazy-docs using the index reader only works for .ndnippy files (at least for now)!

Here's an example on how to use the new lazy-docs, where it's up to the caller to use with-open on the nd-db.index/reader:

;; get top 10 docs after dropping a lot, filtering and sorting the rest:
(with-open [r (nd-db.index/reader nd-db)]
  (->> nd-db
       (lazy-docs r)
       (drop 1000000)
       (filter (comp pos? :amount))
       (sort-by :priority)
       (take 10)))

[x] make lazy-docs work for eager (legacy) .nddbmeta
[x] make lazy-docs work for lazy (v0.9.0+) .nddbmeta
[x] make a new lazy-ids for just returning the IDs of the documents (will be based on the eagerly realized :index for legacy .nddbmeta, and be truly lazy based on the new laziness-friendly .nddbmeta for v0.9.0+)
[x] lazy-ids should be passed the needed Reader (from with-open form)
[x] make a conversion function, to convert from legacy to v0.9.0+ .nddbmeta
[x] make it possible to use lazy-docs without realizing full :index of the database (either by not realizing :index until it's required by i.e. nd-db.core/q, or by calling lazy-docs and lazy-ids without the database value, but with the same parameters as used to initialize the db value.

luposlip / nd-db

nd-db metadata v2 #14