Open simonw opened 2 years ago
A command called index
could do the trick. Or a --index
option to files
. Or both!
These could populate a drives_index
table which just has file_id
and text
columns.
This could combine with --apps
from #30:
google-drive-to-sqlite files starred-searchable.db --apps --starred --index
The HTML version of a Google Doc is actually pretty small:
google-drive-to-sqlite export html 10t3iuUppkbfLcRrzryswNriakHqCk9S-r2KPefHSQds
Gave me an 84KB HTML file - the images were all references to images hosted on Google content servers.
So maybe this isn't just about search indexing after all - it's about exporting content to a SQLite database table.
I'm tempted to add an option to google-drive-to-sqlite export
which writes the exported binary content back to the database.
Then a separate mechanism for the indexing of the plain text.
Both Google Docs and Google Presentations support a plain text export format (now implemented as part of #21).
A tool for extracting that text and using it to populate a FTS index - or just another database table or column - would be really interesting.