piql / insight

Archival packages insight application
GNU General Public License v3.0
8 stars 3 forks source link

Index content of plain text files too #6

Closed petterreinholdtsen closed 4 years ago

petterreinholdtsen commented 4 years ago

Add new DAttachmentIndexer() helper class DText. It reads plain text files and add their context to the search index.

petterreinholdtsen commented 4 years ago

Note, this code do not try to recognize and handle different character encodings. The best way to handle it is probably to use some heuristics when reading the file to guess encoding, assume some default setting (for example ISO-8859-1) if unable to detect the encoding, or at least one of the encodings listed in https://lovdata.no/dokument/SF/forskrift/2017-12-19-2286 .

oleliabo commented 4 years ago

Note, this code do not try to recognize and handle different character encodings. The best way to handle it is probably to use some heuristics when reading the file to guess encoding, assume some default setting (for example ISO-8859-1) if unable to detect the encoding, or at least one of the encodings listed in https://lovdata.no/dokument/SF/forskrift/2017-12-19-2286 .

Nice, there are some linux tools and libs for guessing encoding, could be added later.