simonw / mbox-to-sqlite

Load email from .mbox files into SQLite
Apache License 2.0
27 stars 5 forks source link

Normalizing email headers in SQLite #7

Open knowtheory opened 4 months ago

knowtheory commented 4 months ago

Hey everybody,

I've been spending some time over the past week trying to use mbox-to-sqlite to import my 57gb mbox from GMail. After a bunch of tinkering I've created an sqlite db for myself, but wanted to talk w/ users of this library (or just Simon) about some suggested enhancements, and whether the changes i made would be worth contributing.

Email headers are mixed case, and SQL is not.

As noted in #5 and #6, running mbox-to-sqlite will fail for arbitrary mbox files with a duplicate column error. These failures are caused by the fact that SQL and SQLite's column space are single case while email header keys are mixed case. Attempting to insert mixed case will result in these duplicate column errors.

So what do?

Should mbox-to-sqlite normalize header keys?

I'd posit yes. My personal solution was to preprocess each message that mailbox produced prior to upserting them as mbox-to-sqlite does. The normalization I chose was to map mix-cased keys to lowercase, and merge the values of any colliding keys.

Merging values seems fine for my purposes (search & retrieval). I don't need to trace individual headers back for forensic purposes or the like.

SQLite has a maximum column limit.

After deciding on a strategy for avoiding column naming conflicts, I discovered that SQLite has a limit to the number of columns allowed in a single table. SQLITE_MAX_COLUMNS is a compile time setting, and defaults to 2000. Since this setting is determined at compile time I assume it to be immutable for all practical intents and purposes.

This is a problem because the set of columns that mbox-to-sqlite builds can be arbitrarily large, since it's the union of all of the email header keys contained in the mbox. In my case, 57gb of email contained 4040 unique headers (3574 if mapped to lower case).

Should mbox-to-sqlite provide a (configurable) fixed set of columns?

If we treat SQLITE_MAX_COLUMNS as fixed there's no practical solution other than limiting the number of columns that mbox-to-sqlite produces. The question shifts to what columns to include, and what to do with remaining headers.

For myself I counted the frequency of headers in my mail archive and used that to filter out a list of headers with 1000 entries or more. I used that list of columns to set up my messages table, and filter the keys for each record dict. Any header not in the list gets put into a dict that gets dumped as a JSON string into an additional_headers field.

It shouldn't be hard to automate scanning the headers to calculate the distribution and suggest what an appropriate cut off point might be.

Anyway, if any of these contributions would be helpful, let me know and i can fire up a pr.

knowtheory commented 4 months ago

p.s. SQLite limits aren't accessible via SQL, but Stack Overflow provides an insane but relatively quick way to find SQLITE_MAX_COLUMNS