simonw / sfms-history

The sfms-history project
https://sfms-history.vercel.app
6 stars 1 forks source link

Stable URLs for pages and documents #9

Closed simonw closed 2 years ago

simonw commented 2 years ago

Currently each scanned page has a URL that uses the rowid from the database: https://sfms-history.vercel.app/page/2930

This is risky: it's hard to guarantee that these URLs will stay stable as the database changes in the future.

Using URLs based on the filename is risky too, since the file and folder may change as the Google Drive is re-arranged.

simonw commented 2 years ago

One way to solve this would be to use the ETag (or truncated ETag) of the PDF document - this is based on the content hash, so it shouldn't change when the file is moved around (though I should test that to make sure).

If I do that, the following URLs could work:

simonw commented 2 years ago

I could use the same SEO hack that Eventbrite uses: https://www.eventbrite.com/e/foodieland-night-market-san-mateo-september-23-25-tickets-275309707747 - where the unique ID is after the last - but text from the PDF file name is included in the URL. I don't think it's worth having uglier URLs for better SEO in this case though.

simonw commented 2 years ago

I think the latest version of the database schema produced by s3-ocr should have the details I need to implement this (the database I'm using for sfms-history right now is an earlier one created by hand in a Jupyter notebook).

Running this command now to re-build the database:

% s3-ocr index sfms-history /tmp/index.db
Fetching job details  [####################################]  100%          
Populating pages table  [##----------------------------------]    8%  00:06:13
simonw commented 2 years ago

Here's the schema of that database:

CREATE TABLE [pages] (
   [path] TEXT,
   [page] INTEGER,
   [folder] TEXT,
   [text] TEXT,
   PRIMARY KEY ([path], [page])
);
CREATE VIRTUAL TABLE [pages_fts] USING FTS5 (
    [text],
    content=[pages]
);
CREATE TABLE 'pages_fts_data'(id INTEGER PRIMARY KEY, block BLOB);
CREATE TABLE 'pages_fts_idx'(segid, term, pgno, PRIMARY KEY(segid, term)) WITHOUT ROWID;
CREATE TABLE 'pages_fts_docsize'(id INTEGER PRIMARY KEY, sz BLOB);
CREATE TABLE 'pages_fts_config'(k PRIMARY KEY, v) WITHOUT ROWID;
CREATE TRIGGER [pages_ai] AFTER INSERT ON [pages] BEGIN
  INSERT INTO [pages_fts] (rowid, [text]) VALUES (new.rowid, new.[text]);
END;
CREATE TRIGGER [pages_ad] AFTER DELETE ON [pages] BEGIN
  INSERT INTO [pages_fts] ([pages_fts], rowid, [text]) VALUES('delete', old.rowid, old.[text]);
END;
CREATE TRIGGER [pages_au] AFTER UPDATE ON [pages] BEGIN
  INSERT INTO [pages_fts] ([pages_fts], rowid, [text]) VALUES('delete', old.rowid, old.[text]);
  INSERT INTO [pages_fts] (rowid, [text]) VALUES (new.rowid, new.[text]);
END;
CREATE TABLE [ocr_jobs] (
   [key] TEXT PRIMARY KEY,
   [job_id] TEXT,
   [etag] TEXT,
   [s3_ocr_etag] TEXT
);
CREATE TABLE [fetched_jobs] (
   [job_id] TEXT PRIMARY KEY
);

I'm going to use some CREATE TABLE x AS SELECT ... queries against that to build a better database schema for this project.

simonw commented 2 years ago

New table design:

documents:

pages:

simonw commented 2 years ago

This almost worked:

sqlite-utils sfms.db --attach index2 index.db "$(cat <<EOF
create table documents as select
  substr(s3_ocr_etag, 2, 8) as id,
  key as title,
  key as path,
  replace(s3_ocr_etag, '"', '') as etag
from
  index2.ocr_jobs;
EOF
)"

But https://www.sqlite.org/lang_createtable.html says:

A table created using CREATE TABLE AS has no PRIMARY KEY and no constraints of any kind

simonw commented 2 years ago
sqlite-utils create-table sfms.db documents \
  id text \
  title text \
  path text \
  etag text \
  --pk id

sqlite-utils sfms.db --attach index2 index.db "$(cat <<EOF
insert into documents select
  substr(s3_ocr_etag, 2, 8) as id,
  key as title,
  key as path,
  replace(s3_ocr_etag, '"', '') as etag
from
  index2.ocr_jobs;
EOF
)"
simonw commented 2 years ago
sqlite-utils create-table sfms.db pages \
  id text \
  document_id text \
  page integer \
  text text \
  --pk id
sqlite-utils sfms.db --attach index2 index.db "$(cat <<EOF
insert into pages select
  substr(s3_ocr_etag, 2, 8) || '-' || page as id,
  substr(s3_ocr_etag, 2, 8) as document_id,
  page,
  text
from index2.pages join index2.ocr_jobs on index2.pages.path = index2.ocr_jobs.key;
EOF
)"
simonw commented 2 years ago

I decided to have the ID of the pages as a string constructed as 02209f4f-2 combining the document ID and the page number. I could have used a compound primary key here but using a single primary key makes it easier to foreign key other stuff against it in the future.

simonw commented 2 years ago
sqlite-utils add-foreign-key sfms.db pages document_id documents id
simonw commented 2 years ago

Fixing up the document titles:

sfms-history % sqlite-utils convert sfms.db documents title 'value.split("/")[-1].split(".pdf")[0].replace("_", " ")'

Here's the GPT-3 explanation:

What does this Python do?

value.split("/")[-1].split(".pdf")[0].replace("_", " ")

This Python code takes a string value and splits it into two pieces at the "/" character. It then takes the last piece of the split string and splits it into two pieces at the "." character. Finally, it replaces all "_" characters in the last piece with " " characters.

simonw commented 2 years ago

And enable FTS again:

sqlite-utils enable-fts sfms.db pages text
simonw commented 2 years ago

Here's the new schema:

CREATE TABLE [documents] (
   [id] TEXT PRIMARY KEY,
   [title] TEXT,
   [path] TEXT,
   [etag] TEXT
);
CREATE TABLE [pages] (
   [id] TEXT PRIMARY KEY,
   [document_id] TEXT,
   [page] INTEGER,
   [text] TEXT,
   FOREIGN KEY([document_id]) REFERENCES [documents]([id])
);
CREATE VIRTUAL TABLE [pages_fts] USING FTS5 (
    [text],
    content=[pages]
);
CREATE TABLE 'pages_fts_data'(id INTEGER PRIMARY KEY, block BLOB);
CREATE TABLE 'pages_fts_idx'(segid, term, pgno, PRIMARY KEY(segid, term)) WITHOUT ROWID;
CREATE TABLE 'pages_fts_docsize'(id INTEGER PRIMARY KEY, sz BLOB);
CREATE TABLE 'pages_fts_config'(k PRIMARY KEY, v) WITHOUT ROWID;
simonw commented 2 years ago

Got a TIL out of this: https://til.simonwillison.net/zsh/argument-heredoc

simonw commented 2 years ago

Pushed my work so far to a branch.