nextcloud / fulltextsearch_elasticsearch

🔍 Use Elasticsearch to index the content of your Nextcloud
https://apps.nextcloud.com/apps/fulltextsearch_elasticsearch
GNU Affero General Public License v3.0
81 stars 32 forks source link

Exception occurred at adding a file that is not extracted to attachment.content field. #31

Closed hankei6km closed 6 years ago

hankei6km commented 6 years ago

It is my mistake. I did not set ignore_missing=true option to convert processor in PR(#29 ). Therefore, the following message was output to Elasticsearch server log if add Thumbs.db to index. (Reverting to set processor and trying it, no message was output)

I will PR new edit.

index_1  | [2018-05-30T11:06:09,055][WARN ][r.suppressed             ] path: /my_index/standard/files%3A120361, params: {pipeline=attachment, index=my_index, id=files:120361, type=standard}
index_1  | org.elasticsearch.ElasticsearchException: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content]
index_1  |      at org.elasticsearch.ingest.CompoundProcessor.newCompoundProcessorException(CompoundProcessor.java:156) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1  |      at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:107) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1  |      at org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:58) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1  |      at org.elasticsearch.ingest.PipelineExecutionService.innerExecute(PipelineExecutionService.java:169) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1  |      at org.elasticsearch.ingest.PipelineExecutionService.access$000(PipelineExecutionService.java:42) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1  |      at org.elasticsearch.ingest.PipelineExecutionService$2.doRun(PipelineExecutionService.java:94) [elasticsearch-6.2.4.jar:6.2.4]
index_1  |      at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) [elasticsearch-6.2.4.jar:6.2.4]
index_1  |      at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.2.4.jar:6.2.4]
index_1  |      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
index_1  |      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
index_1  |      at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
index_1  | Caused by: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content]
index_1  |      ... 11 more
index_1  | Caused by: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content]
index_1  |      at org.elasticsearch.ingest.IngestDocument.resolve(IngestDocument.java:337) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1  |      at org.elasticsearch.ingest.IngestDocument.getFieldValue(IngestDocument.java:105) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1  |      at org.elasticsearch.ingest.IngestDocument.getFieldValue(IngestDocument.java:122) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1  |      at org.elasticsearch.ingest.common.ConvertProcessor.execute(ConvertProcessor.java:169) ~[?:?]
index_1  |      at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1  |      ... 9 more
hankei6km commented 6 years ago

I'm sorry for confusing you.

I canceled #32 because there was a new bug. When ignore_missing=true is set, the base64 string in content field is left.

I will consider other fixes.

hankei6km commented 6 years ago

Hello, I tried to summarize the problem.

Problem

Exception occurred at adding a file that is not extracted to attachment.content field.

At least it is not extracted to attachment.content field with the following types of files.

Environment

Apply patch ea5920a5a6f8c46bbef546039f385e8846708ac6 to environment of #27

Steps to reproduce

occ command

 . Indexing: test/data.dat
  result with no content: {"_index":"my_index","_type":"standard","_id":"files:120690","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":332,"_primary_term":1}

Elasticsearch log

index_1  | [2018-06-01T10:06:37,661][WARN ][r.suppressed             ] path: /my_index/standard/files%3A120690, params: {pipeline=attachment, index=my_index, id=files:120690, type=standard}
index_1  | org.elasticsearch.ElasticsearchException: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content]
index_1  |      at org.elasticsearch.ingest.CompoundProcessor.newCompoundProcessorException(CompoundProcessor.java:156) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1  |      at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:107) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1  |      at org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:58) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1  |      at org.elasticsearch.ingest.PipelineExecutionService.innerExecute(PipelineExecutionService.java:169) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1  |      at org.elasticsearch.ingest.PipelineExecutionService.access$000(PipelineExecutionService.java:42) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1  |      at org.elasticsearch.ingest.PipelineExecutionService$2.doRun(PipelineExecutionService.java:94) [elasticsearch-6.2.4.jar:6.2.4]
index_1  |      at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) [elasticsearch-6.2.4.jar:6.2.4]
index_1  |      at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.2.4.jar:6.2.4]
index_1  |      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
index_1  |      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
index_1  |      at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
index_1  | Caused by: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content]
index_1  |      ... 11 more
index_1  | Caused by: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content]
index_1  |      at org.elasticsearch.ingest.IngestDocument.resolve(IngestDocument.java:337) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1  |      at org.elasticsearch.ingest.IngestDocument.getFieldValue(IngestDocument.java:105) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1  |      at org.elasticsearch.ingest.IngestDocument.getFieldValue(IngestDocument.java:122) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1  |      at org.elasticsearch.ingest.common.ConvertProcessor.execute(ConvertProcessor.java:169) ~[?:?]
index_1  |      at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1  |      ... 9 more

Workaround

Clean content field by set processor and convert attachment.content to text by convert(with ignore_missing) processor.

Results of 0.7.2 and fixed ver(https://github.com/hankei6km/fulltextsearch_elasticsearch/commit/5b740b1bdcfd493c1f84f8f57582c3aeaf6d5925).

0.7.2

GET /my_index/standard/files:203?pretty

{
  "_index" : "my_index",
  "_type" : "standard",
  "_id" : "files:203",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "owner" : "user06",
    "groups" : [ ],
    "circles" : [ ],
    "source" : "files_local",
    "title" : "test/random.dat",
    "users" : [ ],
    "content" : "",
    "tags" : [
      "files_local"
    ],
    "attachment" : {
      "content_type" : "application/octet-stream",
      "content_length" : 0
    },
    "provider" : "files",
    "parts" : [ ],
    "share_names" : {
      "user06" : "test/random.dat"
    }
  }
}

fixed version

GET /my_index/standard/files:203?pretty

{
  "_index" : "my_index",
  "_type" : "standard",
  "_id" : "files:203",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "owner" : "user06",
    "groups" : [ ],
    "circles" : [ ],
    "source" : "files_local",
    "title" : "test/random.dat",
    "users" : [ ],
    "content" : "",
    "tags" : [
      "files_local"
    ],
    "attachment" : {
      "content_type" : "application/octet-stream",
      "content_length" : 0
    },
    "provider" : "files",
    "parts" : [ ],
    "share_names" : {
      "user06" : "test/random.dat"
    },
    "hash" : "8f901125b04fd014544341f3640c1894"
  }
}

Downside

ArtificialOwl commented 6 years ago

cannot reproduce it from here, can you send me one of the file that fail to be indexed ?

maxence@nextcloud.com

hankei6km commented 6 years ago

I attached files.zip containing the following files(empty file could not be attached). Exception occurred at add following files(file_size_zero.txt data_random.dat) to index.

Additional information to avoid misunderstanding.

used set

GET my_index/standard/files:295?pretty

{
  "_index" : "my_index",
  "_type" : "standard",
  "_id" : "files:295",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "owner" : "user06",
    "groups" : [ ],
    "circles" : [ ],
    "source" : "files_local",
    "title" : "test/file_size_zero.txt",
    "users" : [ ],
    "content" : "",
    "tags" : [
      "files_local"
    ],
    "attachment" : {
      "content_type" : "application/octet-stream",
      "content_length" : 0
    },
    "provider" : "files",
    "parts" : [ ],
    "share_names" : {
      "user06" : "test/file_size_zero.txt"
    }
  }
}

used convert (_source.content field is missing)

GET my_index/standard/files:295?pretty

{
  "_index" : "my_index",
  "_type" : "standard",
  "_id" : "files:295",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "share_names" : {
      "user06" : "test/file_size_zero.txt"
    },
    "owner" : "user06",
    "users" : [ ],
    "groups" : [ ],
    "circles" : [ ],
    "tags" : [
      "files_local"
    ],
    "hash" : "",
    "provider" : "files",
    "source" : "files_local",
    "title" : "test/file_size_zero.txt",
    "parts" : [ ]
  }
}
ArtificialOwl commented 6 years ago

I will release 0.8.0 as it is, as I am not able to reproduce your issue on my setup. We'll see if others encounters the same issue.

hankei6km commented 6 years ago

Thank you for testing attached files. I will report it if I can find out how to avoid exception by setups in my environment.

mobamoba commented 6 years ago

Just FYI, my log is filled with that error. I'm using most recent version of everything on Nextcloud 13.

whlsxl commented 6 years ago

Me Too

joergmschulz commented 6 years ago

Still ugly. On reindexing, I get │ Message: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content] on one of my servers. The other server indexes all files. They use the same elasticsearch server. One remote (doesn't work), one locally(works). Before the last debian upgrade including elasticsearch 6.3.2 (from 6.2.4) all worked fine.

ArtificialOwl commented 6 years ago

Would you try this: https://github.com/nextcloud/fulltextsearch_elasticsearch/commit/1680df66eeb430d97ea4b1542423449c3cd6bc44 ?

joergmschulz commented 6 years ago

Hm. Yes, now I continue to get this error and the log shows Undefined index: sev at \/var\/www\/nc\/apps\/fulltextsearch\/lib\/Command\/Index.php#629

ArtificialOwl commented 6 years ago

From the ./occ fulltextsearch:index you can get the id of the file that returns that error, can you try:

./occ fulltextsearch:document:platform --content files id-of-the-file

It should return details on the file, and part of its content. Is the content empty ?

joergmschulz commented 6 years ago

sorry - I don't see where the index cmd should return the ID of the file - do I have to add parameters in order to get more info instead of the pure status screen? // am a bit overwhelmed by the syntax of /occ fulltextsearch:document:platform --content 'js Family/Voltigieren 2017 07 27/DSC_5154.png' 191889

ArtificialOwl commented 6 years ago

selection_019

In the second part of the index interface, you have your list of errors, you can navigate throw them using 'h' and 'j', the last line (Index: bookmark:8) should be Index: files:id_of_the_file

ArtificialOwl commented 6 years ago

It reminds me that you might still have errors because you need to reset them. You can select the errors and delete them one by one using 'd', or you can reset all errors:

./occ fulltextsearch:index "{\"errors\": \"reset\"}"
joergmschulz commented 6 years ago

Ok I see. No new errors come up. The reset syntax doesn't work for me (counter still at 675.) The result for one of the errors is:

"contentSize": 0, "tags": [ "files_group_folders" ], "more": [], "excerpts": [], "score": null }, "content": "" And this is quite understandable because the .jpg doesn't contain anything the OCR should have found. So let's see whether the complete reindex that currently runs yields anything reasonable. Will take a while before my userId is hit.

joergmschulz commented 6 years ago

Fine. New error: │ Progress: 368/1885

│ Error:    677/677
│ Exception: Elasticsearch\Common\Exceptions\ServerErrorResponseException
│ Message: java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [content]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@8f60923]; nested: StringIndexOutOfBoundsException[String index out of range: 16];

{ "document": { "id": "184708", "providerId": "files", "access": { "ownerId": "", "viewerId": null, "users": [], "groups": [ "xxxxxxx" ], "circles": [], "links": [] }, "modifiedTime": 0, "title": "XX und JS geteilt\/X\u00f6xxxxxxxxx\/xxxxxxxxx 2.0.pdf", "link": "", "index": null, "source": "files_group_folders", "info": [], "hash": "7ed58605f2591482e8756964b94a95fd", "contentSize": 0, "tags": [ "files_group_folders" ], "more": [], "excerpts": [], "score": null }, "content": "" }

ArtificialOwl commented 6 years ago

ok, the issue is that the backslash were lost during my post:

   ./occ fulltextsearch:index "{\"errors\": \"reset\"}"
ArtificialOwl commented 6 years ago

about the Tika issue, this is slightly different as it seems to be an issue with Tika not able to parse your pdf

joergmschulz commented 6 years ago

agree, the Tika is offtopic. And, agree, your aforementioned fix seems to do the job. Now I'll wait for the index process to complete. Lateron I might file just another issue -- the fulltextsearch icon is not displayed in the menu bar) but I have to reproduce this one because it doesn't happen on my 2nd server. If I don't file the issue, I couldn't reproduce/verify.

ArtificialOwl commented 6 years ago

Thanks for your report.

Regarding the icon, you need to complete a first index before the fulltextsearch being enabled on the Web UI.

joergmschulz commented 6 years ago

agree on this as well. Icon appears. Thumbs up - your fix will make it work for others as well. Thank you so much for jumping in and helping me to debug this one.

ArtificialOwl commented 6 years ago

@hankei6km please re-open the ticket if you think it is necessary

hankei6km commented 6 years ago

I am trying 1680df66eeb430d97ea4b1542423449c3cd6bc44. It works well on my environment.

Thank you so much for your fix.

dmuensterer commented 2 years ago

Still getting the same content empty error with Nextcloud 23. Is there anything to fix this? I've even already deleted all empty files. Still getting this error.