nextcloud / fulltextsearch

🔍 Core of the full-text search framework for Nextcloud
GNU Affero General Public License v3.0
212 stars 51 forks source link

configure filetypes/paths to ignore #236

Open jkoopmann opened 6 years ago

jkoopmann commented 6 years ago

Hi,

my initial index keeps running into memory errors. Just noticed it always happens on a .m2ts file. This is several GB big and I suspect if fulltextsearch passes it to elasticsearch things go crazy. Moreover indexing videofiles in general might not be a good idea.

How can I tell the plugin or elasticsearch to ignore certain extensions, filesizes or paths?

On another note: Do I need to do anything special to have PDFs,TIFFs etc OCRed besides having tesseract installed?

Regards, JP

ArtificialOwl commented 6 years ago

I will add a limit to the filesize and some ignore on extension also

fb-erik commented 6 years ago

I would suggest ignoring all hidden files as a default. There is also Synology folders like @eaDir that shouldn't be indexed which can't be ignored based on file extension.

ArtificialOwl commented 6 years ago

Adding a .noindex file will ignore files and subfolders

jkoopmann commented 6 years ago

true. However typical users will most likely not pay attention to this and create ".noindex" files will they? :-)

ArtificialOwl commented 6 years ago

this is however the simplest way to tell the app not to index a full directory

fb-erik commented 6 years ago

The .noindex files work for one-off, static folders, but a lot of hidden folders are generated on the fly as files and folders are created and removed.

In what situation would you want to index hidden (dot-)files?

ArtificialOwl commented 6 years ago

Yes, we totally agree on that point, I was not clear enough.

I will add an option to enable indexing/searching within hidden files

fb-erik commented 6 years ago

Thanks @daita

theroch commented 3 years ago

The .noindex file excludes the entire folder and the files it contains. But how can I exclude files by pattern? I will exclude temporary Word files like "~*.docx" from indexing because this files throws always 'java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content]'

t-markmann commented 1 week ago

Do my users have to create the .noindex file in the Nextcloud WebGUI or can I create it as server admin on the filesystem, without it being transparent to the users?