Closed hankei6km closed 6 years ago
I'm sorry for confusing you.
I canceled #32 because there was a new bug.
When ignore_missing=true
is set, the base64 string in content
field is left.
I will consider other fixes.
Hello, I tried to summarize the problem.
Exception occurred at adding a file that is not extracted to attachment.content
field.
At least it is not extracted to attachment.content
field with the following types of files.
Thumbs.db
*.lnk
(Windows shortcut file).*.dat
etc).Apply patch ea5920a5a6f8c46bbef546039f385e8846708ac6 to environment of #27
$ touch data.dat
on local machinedata.dat
to Nextcloud server.occ command
. Indexing: test/data.dat
result with no content: {"_index":"my_index","_type":"standard","_id":"files:120690","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":332,"_primary_term":1}
Elasticsearch log
index_1 | [2018-06-01T10:06:37,661][WARN ][r.suppressed ] path: /my_index/standard/files%3A120690, params: {pipeline=attachment, index=my_index, id=files:120690, type=standard}
index_1 | org.elasticsearch.ElasticsearchException: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content]
index_1 | at org.elasticsearch.ingest.CompoundProcessor.newCompoundProcessorException(CompoundProcessor.java:156) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1 | at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:107) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1 | at org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:58) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1 | at org.elasticsearch.ingest.PipelineExecutionService.innerExecute(PipelineExecutionService.java:169) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1 | at org.elasticsearch.ingest.PipelineExecutionService.access$000(PipelineExecutionService.java:42) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1 | at org.elasticsearch.ingest.PipelineExecutionService$2.doRun(PipelineExecutionService.java:94) [elasticsearch-6.2.4.jar:6.2.4]
index_1 | at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) [elasticsearch-6.2.4.jar:6.2.4]
index_1 | at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.2.4.jar:6.2.4]
index_1 | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
index_1 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
index_1 | at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
index_1 | Caused by: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content]
index_1 | ... 11 more
index_1 | Caused by: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content]
index_1 | at org.elasticsearch.ingest.IngestDocument.resolve(IngestDocument.java:337) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1 | at org.elasticsearch.ingest.IngestDocument.getFieldValue(IngestDocument.java:105) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1 | at org.elasticsearch.ingest.IngestDocument.getFieldValue(IngestDocument.java:122) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1 | at org.elasticsearch.ingest.common.ConvertProcessor.execute(ConvertProcessor.java:169) ~[?:?]
index_1 | at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-6.2.4.jar:6.2.4]
index_1 | ... 9 more
Clean content
field by set
processor and convert attachment.content
to text by convert
(with ignore_missing
) processor.
Results of 0.7.2 and fixed ver(https://github.com/hankei6km/fulltextsearch_elasticsearch/commit/5b740b1bdcfd493c1f84f8f57582c3aeaf6d5925).
0.7.2
GET /my_index/standard/files:203?pretty
{
"_index" : "my_index",
"_type" : "standard",
"_id" : "files:203",
"_version" : 1,
"found" : true,
"_source" : {
"owner" : "user06",
"groups" : [ ],
"circles" : [ ],
"source" : "files_local",
"title" : "test/random.dat",
"users" : [ ],
"content" : "",
"tags" : [
"files_local"
],
"attachment" : {
"content_type" : "application/octet-stream",
"content_length" : 0
},
"provider" : "files",
"parts" : [ ],
"share_names" : {
"user06" : "test/random.dat"
}
}
}
fixed version
GET /my_index/standard/files:203?pretty
{
"_index" : "my_index",
"_type" : "standard",
"_id" : "files:203",
"_version" : 1,
"found" : true,
"_source" : {
"owner" : "user06",
"groups" : [ ],
"circles" : [ ],
"source" : "files_local",
"title" : "test/random.dat",
"users" : [ ],
"content" : "",
"tags" : [
"files_local"
],
"attachment" : {
"content_type" : "application/octet-stream",
"content_length" : 0
},
"provider" : "files",
"parts" : [ ],
"share_names" : {
"user06" : "test/random.dat"
},
"hash" : "8f901125b04fd014544341f3640c1894"
}
}
attachment.content
field, attachment
pipeline is exit by convert
processor.cannot reproduce it from here, can you send me one of the file that fail to be indexed ?
maxence@nextcloud.com
I attached files.zip containing the following files(empty file could not be attached).
Exception occurred at add following files(file_size_zero.txt
data_random.dat
) to index.
file_size_zero.txt
: File size=0 (empty file)data_random.dat
: Random dataAdditional information to avoid misunderstanding.
The log of elasticsearch was displayed by using docker-compose logs -f
.
occ fulltextsearch:index
command continued to run after an exception occurred on elasticsearch server side. It is caught in ElasticSearchPlatform#indexDocument
method in fulltextsearch_elasticsearch/lib/Platform/ElasticSearchPlatform.php
.
Attached files(file_size_zero.txt
and data_random.dat
) were indexed, but the result is different when set
and convert
(without ignore_missing
) are used.
used set
GET my_index/standard/files:295?pretty
{
"_index" : "my_index",
"_type" : "standard",
"_id" : "files:295",
"_version" : 1,
"found" : true,
"_source" : {
"owner" : "user06",
"groups" : [ ],
"circles" : [ ],
"source" : "files_local",
"title" : "test/file_size_zero.txt",
"users" : [ ],
"content" : "",
"tags" : [
"files_local"
],
"attachment" : {
"content_type" : "application/octet-stream",
"content_length" : 0
},
"provider" : "files",
"parts" : [ ],
"share_names" : {
"user06" : "test/file_size_zero.txt"
}
}
}
used convert
(_source.content
field is missing)
GET my_index/standard/files:295?pretty
{
"_index" : "my_index",
"_type" : "standard",
"_id" : "files:295",
"_version" : 1,
"found" : true,
"_source" : {
"share_names" : {
"user06" : "test/file_size_zero.txt"
},
"owner" : "user06",
"users" : [ ],
"groups" : [ ],
"circles" : [ ],
"tags" : [
"files_local"
],
"hash" : "",
"provider" : "files",
"source" : "files_local",
"title" : "test/file_size_zero.txt",
"parts" : [ ]
}
}
I will release 0.8.0 as it is, as I am not able to reproduce your issue on my setup. We'll see if others encounters the same issue.
Thank you for testing attached files. I will report it if I can find out how to avoid exception by setups in my environment.
Just FYI, my log is filled with that error. I'm using most recent version of everything on Nextcloud 13.
Me Too
Still ugly.
On reindexing, I get
│ Message: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content]
on one of my servers.
The other server indexes all files. They use the same elasticsearch server. One remote (doesn't work), one locally(works).
Before the last debian upgrade including elasticsearch 6.3.2 (from 6.2.4) all worked fine.
Hm. Yes, now I continue to get this error and the log shows
Undefined index: sev at \/var\/www\/nc\/apps\/fulltextsearch\/lib\/Command\/Index.php#629
From the ./occ fulltextsearch:index
you can get the id of the file that returns that error, can you try:
./occ fulltextsearch:document:platform --content files id-of-the-file
It should return details on the file, and part of its content. Is the content empty ?
sorry - I don't see where the index cmd should return the ID of the file - do I have to add parameters in order to get more info instead of the pure status screen? // am a bit overwhelmed by the syntax of
/occ fulltextsearch:document:platform --content 'js Family/Voltigieren 2017 07 27/DSC_5154.png' 191889
In the second part of the index interface, you have your list of errors, you can navigate throw them using 'h' and 'j', the last line (Index: bookmark:8) should be Index: files:id_of_the_file
It reminds me that you might still have errors because you need to reset them. You can select the errors and delete them one by one using 'd', or you can reset all errors:
./occ fulltextsearch:index "{\"errors\": \"reset\"}"
Ok I see. No new errors come up. The reset syntax doesn't work for me (counter still at 675.) The result for one of the errors is:
"contentSize": 0, "tags": [ "files_group_folders" ], "more": [], "excerpts": [], "score": null }, "content": ""
And this is quite understandable because the .jpg doesn't contain anything the OCR should have found. So let's see whether the complete reindex that currently runs yields anything reasonable. Will take a while before my userId is hit.
Fine. New error: │ Progress: 368/1885
│ Error: 677/677
│ Exception: Elasticsearch\Common\Exceptions\ServerErrorResponseException
│ Message: java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [content]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@8f60923]; nested: StringIndexOutOfBoundsException[String index out of range: 16];
{ "document": { "id": "184708", "providerId": "files", "access": { "ownerId": "", "viewerId": null, "users": [], "groups": [ "xxxxxxx" ], "circles": [], "links": [] }, "modifiedTime": 0, "title": "XX und JS geteilt\/X\u00f6xxxxxxxxx\/xxxxxxxxx 2.0.pdf", "link": "", "index": null, "source": "files_group_folders", "info": [], "hash": "7ed58605f2591482e8756964b94a95fd", "contentSize": 0, "tags": [ "files_group_folders" ], "more": [], "excerpts": [], "score": null }, "content": "" }
ok, the issue is that the backslash were lost during my post:
./occ fulltextsearch:index "{\"errors\": \"reset\"}"
about the Tika issue, this is slightly different as it seems to be an issue with Tika not able to parse your pdf
agree, the Tika is offtopic. And, agree, your aforementioned fix seems to do the job. Now I'll wait for the index process to complete. Lateron I might file just another issue -- the fulltextsearch icon is not displayed in the menu bar) but I have to reproduce this one because it doesn't happen on my 2nd server. If I don't file the issue, I couldn't reproduce/verify.
Thanks for your report.
Regarding the icon, you need to complete a first index before the fulltextsearch being enabled on the Web UI.
agree on this as well. Icon appears. Thumbs up - your fix will make it work for others as well. Thank you so much for jumping in and helping me to debug this one.
@hankei6km please re-open the ticket if you think it is necessary
I am trying 1680df66eeb430d97ea4b1542423449c3cd6bc44. It works well on my environment.
Thank you so much for your fix.
Still getting the same content empty error with Nextcloud 23. Is there anything to fix this? I've even already deleted all empty files. Still getting this error.
It is my mistake. I did not set
ignore_missing=true
option toconvert
processor in PR(#29 ). Therefore, the following message was output to Elasticsearch server log if addThumbs.db
to index. (Reverting toset
processor and trying it, no message was output)I will PR new edit.