irthomasthomas / undecidability

1 stars 0 forks source link

Bibliothecula: Using sqlite3 as a Notekeeping Document Graph with Automatic Reference Indexing #779

Open irthomasthomas opened 1 month ago

irthomasthomas commented 1 month ago

Bibliothecula: Using sqlite3 as a Notekeeping Document Graph with Automatic Reference Indexing

Snippet

bibliothecula
bibliothecula

Using sqlite3 as a notekeeping document graph with automatic reference indexing

2021 June 29 by epilys go back

Instructions

IMPORTANT: ADD NO ADDITIONAL COMMENTARY OR TEXT OF ANY KIND, except that which is needed to sensibly render the document.

Content

TL;DR:

The full schema required for this article's examples is here

There's a web demo of sociologist Niklas Luhmann's zettelkasten you can explore online using sql.js, sqlite3 compiled to webassembly (total compressed asset size: 16MB). source code

The full-text functionality of sqlite3 along with the powerful SQL indexing and trigger features allows us to easily keep notes with references in an sqlite3 database. In this document I present a workflow for doing so.

First, let's examine the pros and cons of this workflow:

Pros:

Cons:

The schema

side-note: How to create a new database by reading the creation statements from a file:

$ sqlite3 notes.db # create new database
sqlite> .read schema.sql
sqlite> -- inspect the new schema:
sqlite> .schema

You can use anything you like as long as it has a basic property: your notes table must have a unique id that you can reference in plain text.

For this demo, I use the bibliothecula schema which has UUIDs for primary keys and allows you to tag or add other arbitrary metadata (and files) to each document. In this model, the document is our notes collection and the files of this document can include plain text ones that are our notes.

side-note: sqlite3 doesn't have a built-in way to produce UUIDs. There's an official uuid.c extension you can compile and load but that's all. There might be a way to generate UUIDs natively using the provided randomness functions.

The sqlite3 documentation for randomblob states:

randomblob(N)

The randomblob(N) function return an N-byte blob containing
pseudo-random bytes. If N is less than 1 then a 1-byte
random blob is returned.

Hint: applications can generate globally unique
identifiers using this function together with
hex() and/or lower() like this:

  hex(randomblob(16))

  lower(hex(randomblob(16)))

Keep in mind that this is NOT a UUID. UUIDs are not just random numbers, they have some structure. You can easily generate UUIDs with stock python3:

$ python3
>>> from uuid import *
>>> uuid4().hex # we don't want hyphens in our UUIDs
'f1272b12b2174e3aa0e7c05610592ac0'

The table used for files in bibliothecula is BinaryMetadata; since it's binary it can also hold plain text data. This is the CREATE statement for BinaryMetadata:

CREATE TABLE IF NOT EXISTS "BinaryMetadata" ( "uuid" CHARACTER(32) NOT NULL PRIMARY KEY, "name" TEXT NULL, "data" BLOB NOT NULL, "compressed" BOOLEAN NOT NULL DEFAULT (0), "created" DATETIME NOT NULL DEFAULT (strftime ('%Y-%m-%d %H:%M:%f', 'now')), "last_modified" DATETIME NOT NULL DEFAULT (strftime ('%Y-%m-%d %H:%M:%f', 'now')), CONSTRAINT uniqueness UNIQUE ("name", "data") );

The name column can hold our filename. What about mime type? Furthermore, what if I want to know the size of a file, do I have to calculate the data length every time? †

†. By the way, doing LENGTH(data) for non-BLOB columns is wrong, because they may include NUL bytes. Always do LENGTH(CAST(col AS BLOB)) for TEXT.

The default sqlite distribution includes the JSON1 extension which allows us to place structured data in a column, so I chose to store filename, mime type and size in bytes in the name column. Examples:

[{"content_type":"text/markdown","filename":"2021-06-20.md","size":0}, {"content_type":"text/markdown","filename":"dataintegrity.md","size":566}, {"content_type":"text/markdown","filename":"exports.md","size":34}, {"content_type":"text/markdown","filename":"generate_tool.md","size":229}, {"content_type":"text/markdown","filename":"shell.md","size":240}, {"content_type":"text/plain","filename":"","size":97632}, {"content_type":"text/plain","filename":"test.txt","size":91}]

Again, this is only for convenience. Our notes don't have to have filenames if they already have a unique identifier, and there's no restriction for filename UNIQUENESS anywhere.

You can create JSON objects with the json_object SQL function, and extract fields with the json_extract SQL function:

SELECT json_extract(name, '$.content_type') FROM BinaryMetadata WHERE json_valid(name) LIMIT 1;
INSERT INTO BinaryMetadata(uuid,name,data) VALUES ('623fec5beac242fcb0b0d17ada20e2b5',json_object('content_type','text/plain','filename','file.txt','size',LENGTH(readfile('file.txt'))),readfile('file.txt'));

Note the use of json_valid to ignore non-JSON names, and also the use of readfile: this is a CLI-only function allowing you to read files as BLOBs. We can use it to quickly attach files to our note database.

The indices

Full-text search

I will use the fts5 extension, included by default nowadays in sqlite3. To create an fts5 index, I issue:

CREATE VIRTUAL TABLE IF NOT EXISTS zettel_fts USING fts5(title, filename, full_text, uuid UNINDEXED);

Note that this doesn't seem limited to our text notes; indeed I can produce the full text of other attached binary files like PDFs and index them too, or maybe at a dedicated fts5 table as well.

The fts5 index needs to be filled manually by us, and we can use SQL triggers to automate this.

An INSERT trigger for BinaryMetadata might look like:

CREATE TRIGGER fts_insert AFTER INSERT ON BinaryMetadata WHEN json_valid(NEW.name) BEGIN INSERT INTO zettel_fts(uuid, title, filename, full_text) VALUES (NEW.uuid, NEW.name, json_extract(NEW.name, '$.filename'), NEW.data); END;

I insert some dummy values:

INSERT INTO BinaryMetadata(uuid,name,data) VALUES  ('623fec5beac242fcb0b0d17ada20e2b5', json_object('content_type','text/plain','filename','file.txt','size',5), 'sun bicycle trigger journal'),  ('37a3ff02c8cd4d7fb3280e5b160d1389', json_object('content_type','text/plain','filename','book_ref.md','size',1), 'I have no references and I must scream'),  ('b0697d8d76ae41bf8e942d505aff8963', json_object('content_type','text/plain','filename','note.md','size',1), 'I refer to 623fec5b-eac2-42fc-b0b0-d17ada20e2b5 and also 37a3ff02c8cd4d7fb3280e5b160d1389');

Querying the index is as simple as SELECTing from it:

SELECT uuid, snippet(zettel_fts, -1, '<mark>', '</mark>', '[...]', 10) AS snippet FROM zettel_fts('journal');
uuid snippet
623fec5beac242fcb0b0d17ada20e2b5 sun bicycle trigger journal

Read the fts5 documentation here.

Reference index

First we need a way to recognize UUIDs in text. For this purpose I create a text tokenizer using the fts3 text tokenizers that spouts tokens that look like UUIDs:

CREATE VIRTUAL TABLE IF NOT EXISTS uuidtok USING fts3tokenize( 'unicode61', "tokenchars=-1234567890abcdefABCDEF", "separators= " );

The UUIDs are spouted when you query the tokenizer. Querying a tokenizer in general is done with a special SELECT:

SELECT token FROM uuidtok where input = 'sun bicycle trigger journal';

token

sun bicycle trigger journal

Now, to get stuff that look like UUIDs from the tokenizer:

SELECT DISTINCT REPLACE(token, '-', '') as ref FROM uuidtok WHERE input = (SELECT data FROM BinaryMetadata WHERE uuid = 'b0697d8d76ae41bf8e942d505aff8963') AND LENGTH(REPLACE(token, '-', '')) = 32

This returns:

ref

623fec5beac242fcb0b0d17ada20e2b5 37a3ff02c8cd4d7fb3280e5b160d1389

Note the use of REPLACE to exclude any hyphens from our processing.

Now we can create a reference index that we can update on insert/update/delete with triggers:

CREATE VIRTUAL TABLE refs_fts USING fts5(referrer, target);

We can make triggers that use the SELECT DISTINCT above along with a check that the reference target exists by adding

AND EXISTS (select 1 from BinaryMetadata WHERE uuid = REPLACE(token, '-', ''))

By having two columns in refs_fts, referrer and target we can get all references inside a note and all back references from other notes.

Examples

INSERT INTO refs_fts(target, referrer) SELECT DISTINCT REPLACE(tok.token, '-', '') AS target, b.uuid AS referrer FROM uuidtok AS tok, (SELECT uuid, data, json_extract(name, '$.content_type') AS _type FROM BinaryMetadata WHERE json_valid(name) AND _type LIKE "%text/%") AS b WHERE tok.input=b.data AND LENGTH(REPLACE(tok.token, '-', '')) = 32 AND EXISTS (SELECT * FROM BinaryMetadata WHERE uuid = REPLACE(tok.token, '-', ''));

SELECT DISTINCT referrer FROM refs_fts WHERE target = '623fec5beac242fcb0b0d17ada20e2b5';

referrer

b0697d8d76ae41bf8e942d505aff8963

SELECT DISTINCT target FROM refs_fts WHERE referrer = 'b0697d8d76ae41bf8e942d505aff8963';

target

623fec5beac242fcb0b0d17ada20e2b5 37a3ff02c8cd4d7fb3280e5b160d1389

Miscellanea

We can read text from the sqlite3 CLI by just SELECTing the data. To save the data, text or any binary into a file use the writefile CLI function:

SELECT writefile(json_extract(name, '$.filename'),data) FROM BinaryMetadata WHERE uuid = '623fec5beac242fcb0b0d17ada20e2b5';

To insert a file as a BLOB, use the readfile CLI function (example from sqlite3 documentation:

INSERT INTO images(name,type,img) VALUES('icon','jpeg',readfile('icon.jpg'));

To edit a file, use edit() (again, CLI only):

UPDATE pics SET img=edit(img,'gimp') WHERE id='pic-1542';

Yes, that means you can use your editor of choice without problem. You can also just view any file with edit() by selecting it without doing any UPDATE.

sqlite supports Common Table Expressions, an SQL standard that allows you to query hierarchical relationships like nodes in a graph. That means you can easily find all the notes you can reach with references and back references. For info see the documentation

The full schema required for this article's examples is here

There's a web demo of sociologist Niklas Luhmann's zettelkasten you can explore online using sql.js, sqlite3 compiled to webassembly (total compressed asset size: 16MB). source code

Epilogue

You can check out the bibliothecula project if you are interested in small tools to support tagged storage inside sqlite3 databases.

Updated 2021 June 29: Discussion on lobste.rs.

Task

Suggested Labels

Suggested labels

{'label-name': 'Notekeeping', 'label-description': 'Using sqlite3 for notekeeping with automatic indexing and reference management.', 'confidence': 57.62}

irthomasthomas commented 1 month ago

Related content

74 similarity score: 0.86

754 similarity score: 0.85

545 similarity score: 0.84

695 similarity score: 0.84

678 similarity score: 0.84

325 similarity score: 0.83