jprante / elasticsearch-knapsack

Knapsack plugin is an import/export tool for Elasticsearch
Apache License 2.0
472 stars 77 forks source link

error during the copy an index #84

Closed jmreymond closed 8 years ago

jmreymond commented 9 years ago

I copy an index in the local cluster and the process stopped with message

[2015-09-28 08:26:00,008][WARN ][org.apache.pdfbox.pdfparser.PDFParser] Parsing Error, Skipping Object java.io.IOException: Push back buffer is full at java.io.PushbackInputStream.unread(PushbackInputStream.java:232) at org.apache.pdfbox.io.PushBackInputStream.unread(PushBackInputStream.java:143) at org.apache.pdfbox.io.PushBackInputStream.unread(PushBackInputStream.java:132) at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:572) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1239) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:133) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:506) at org.elasticsearch.index.mapper.attachment.AttachmentMapper.parse(AttachmentMapper.java:446) at org.elasticsearch.index.mapper.object.ObjectMapper.serializeValue(ObjectMapper.java:706) at org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:497) at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:544) at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:493) at org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:493) at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:409) at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:148) at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase.performOnPrimary(TransportShardReplicationOperationAction.java:574) at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1.doRun(TransportShardReplicationOperationAction.java:440) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

Is it possible to catch the error and continue ?

jprante commented 9 years ago

This is an issue due to PDF indexing, maybe caused by the attachment mapper plugin, not by knapsack.

Please update to the latest attachment mapper plugin version, if possible, or open an issue there.

If this is a memory problem, consider switching off replicas for the knapsack import (the issue happens while replica processing).

jmreymond commented 9 years ago

The from index is correct and is indexed by elasticsearch. I can remove the document but no name is given and furthermore, very difficult to remove

jprante commented 8 years ago

As said, this is not an error caused by knapsack plugin, but will occur when processing the PDF.