simonw / google-drive-to-sqlite

Create a SQLite database containing metadata from Google Drive
https://datasette.io/tools/google-drive-to-sqlite
Apache License 2.0
152 stars 13 forks source link

Mechanism for populating a FTS index with text exported from Google Docs #28

Open simonw opened 2 years ago

simonw commented 2 years ago

Both Google Docs and Google Presentations support a plain text export format (now implemented as part of #21).

A tool for extracting that text and using it to populate a FTS index - or just another database table or column - would be really interesting.

simonw commented 2 years ago

A command called index could do the trick. Or a --index option to files. Or both!

These could populate a drives_index table which just has file_id and text columns.

This could combine with --apps from #30:

google-drive-to-sqlite files starred-searchable.db --apps --starred --index
simonw commented 2 years ago

The HTML version of a Google Doc is actually pretty small:

google-drive-to-sqlite export html 10t3iuUppkbfLcRrzryswNriakHqCk9S-r2KPefHSQds

Gave me an 84KB HTML file - the images were all references to images hosted on Google content servers.

So maybe this isn't just about search indexing after all - it's about exporting content to a SQLite database table.

simonw commented 2 years ago

I'm tempted to add an option to google-drive-to-sqlite export which writes the exported binary content back to the database.

Then a separate mechanism for the indexing of the plain text.