swissbib / searchconf

Contains configurations and descriptions of swissbib search servers
0 stars 0 forks source link

parsing solrDocuments errors with SAX #39

Closed guenterh closed 6 years ago

guenterh commented 6 years ago

it happens that SAX parsing of some solr documents throws errors for specific fields.

1) processing of time_stamp fields like freshness, time_indexed, time_processed Mysteriously only for around 200 docs (approximately) this error occurs. But with the negative side effect that all the documents in one package are neglected by the server during indexing. So in the end around 200.000 documents weren't indexed at all, which isn't acceptable.

Because I can't find the reason for this behavior first hand I implemented a workaround. https://github.com/swissbib/indexerSolrClient/commit/ae00e0d61148e12823d23c859bb2190da544b726

Other possibilities: Try to use another parser (java pull)

I think for the sake of simplicity and efficiency we can live with this solution for STP

guenterh commented 6 years ago

Another parsing error happens related to the document with id 358773261 (for the exception message look at the bottom of this comment)

For the solr document follow the link parse.error.txt

org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at [ server address ] Async exception during distributed update: Error from server at [specific address ] Bad Request

Because it's the only doc at the moment we should disregard it STP and come back to it later

request: [xxx internal address]Remote error message: Exception writing document id 358773261 to the index; possible analysis error: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=339,endOffset=346,lastStartOffset=340 for field 'title_alt' at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:643) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244) at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:194) at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:106) at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:71) at org.swissbib.solr.IndexerMFClient.process(IndexerMFClient.java:138) at org.swissbib.solr.ExtendedFileOpener.process(ExtendedFileOpener.java:76) at org.swissbib.solr.ExtendedFileOpener.process(ExtendedFileOpener.java:22) at org.metafacture.files.DirReader.dir(DirReader.java:79) at org.metafacture.files.DirReader.dir(DirReader.java:76) at org.metafacture.files.DirReader.process(DirReader.java:57) at org.metafacture.files.DirReader.process(DirReader.java:35) at org.metafacture.flux.parser.StringSender.process(StringSender.java:38) at org.metafacture.flux.parser.Flow.start(Flow.java:110) at org.metafacture.flux.parser.FluxProgramm.start(FluxProgramm.java:156) at org.metafacture.runner.Flux.main(Flux.java:79)

guenterh commented 6 years ago

for the time being closed