Closed dr0i closed 8 years ago
The indexing with the current URI enforcer fixes 90% of the lost records. Still 500 documents are not being indexed, thus there is room of imrovement (also see the helpful comments from @fsteeg ). Will switch the index to production (resulting in +4500 documents) while working on this ticket.
@acka47 please also have a look at Q2 in https://github.com/hbz/lobid-resources/pull/87#issuecomment-230430184 . The mentioned "empty strings" are produced when URLs are recognized as broken, see e.g. the urn
in field seeAlso
of http://lobid.org/resource/HT004426370. The entry will be ignored, see http://gaia.hbz-nrw.de:9200/resources/resource/HT004426370.
I agree with @dr0i to drop the triples in question by now and provide the Verbundgruppe with statistics of incorrect URL input. Maybe they will
If they react we can provide a list of affected resources.
Deployed to production. With this, the API2.0 data are indexed in the ES version 2.3.3 at quaoar1.
There are still some (around 4k) documents dropped when indexing them into elasticsearch because they don't fulfill the expectations (being an ID or an String). Examples for
org.elasticsearch.index.mapper.MapperParsingException
:The fixes for #80 , #69 are not abstract enough to cover also those field normalizations. Question is: What to do? Could write e.g. a metafacture URI validator, or a URI fixer which tries to fix a URI. The latter would mean to program algorithms which could fix some URIs , but not all. Also possible would be to have just a validator and simply reject field which are not correctly catalogued.