simon987 / sist2

Lightning-fast file system indexer and search tool
GNU General Public License v3.0
843 stars 55 forks source link

Support Apache Tika server for extracting metadata #445

Closed Doc-Steve closed 8 months ago

Doc-Steve commented 8 months ago

Which SIST2 component is your Feature Request related to? Scan

Is your feature request related to a problem? Please describe. Indexed file formats are limited and I presume it is a tedious programming task to extend more formats and maintain them. Currently I'm using SIST2 mainly to index PDFs. It's really great and I would like to see other file formats like mbox, too.

What would you like to see happen? Please support the use of an Apache Tika Server http://tika.apache.org. The server itself could be integrated easily within the docker-compose file (see docker hub). Using the server's REST API would extract metadata from over a thousand different file types with permanent maintenance by Apache, OCR integration ... Unfortunately I found no C library for the API, but e.g. Tika Python which maybe could be used as a template.

Thanks in advance for considering!

simon987 commented 8 months ago

Hi, for performance reasons it's not feasible to use tika. Other than mbox, what other formats are missing? Feel free to open an issue for each missing format