paradisec-archive / nabu

nabu is a digital media item management system that provides a catalog of audio and video items, metadata for these items, and information about the workflow status of the items.
GNU General Public License v3.0
17 stars 8 forks source link

Add contents of text like files to search engine #757

Open johnf opened 2 months ago

johnf commented 2 months ago

Now that we've switch to opensearch we should start indexing the contents of text like files/essences e.g. elan, pdf, rtf, doc and add these to the search engine.

nthieberger commented 2 months ago

Does that mean a seach within NABU would also find text within files? That is the virtue of the ROCrate solution so it is intersting if it could also available in NABU .... but maybe it iss best left to the new version if it wil take effort to include in NABU?

johnf commented 2 months ago

Yes, my thinking is that whenever an item that can be converted to text is uploaded, we add it to Elasticsearch and make the text searchable. I suspect I'd add it as part of the ingest pipeline.

Will need some thinking on the proper workflow and how to get it right but wouldn't be particularly difficult

Given that we are probably only talking about gigabytes of text, it might even make sense to add it to the database to make reindexing trivial.

johnf commented 2 months ago

Probably not something to work on straight away but I created some new issues as I was cleaning up GitHub yesterday