hbz / lobid-resources

Transformation, web frontend, and API for the hbz catalog as LOD
http://lobid.org/resources
Eclipse Public License 2.0
8 stars 7 forks source link

Ensure URIs in id field #85

Closed dr0i closed 8 years ago

dr0i commented 8 years ago

There are still some (around 4k) documents dropped when indexing them into elasticsearch because they don't fulfill the expectations (being an ID or an String). Examples for org.elasticsearch.index.mapper.MapperParsingException:

object mapping value
sameAs http://worldcat.org/oclc/ocm00895971^
similar http://dx.doi.org/10.4128/ 9781606493311
seeAlso ftp://ppftpuser:welcome@ftp01.penguingroup.com/Booksellers
tableOfContents www.folkwang-uni.de/fileadmin/medien/Downloads/Bibliothek/Inhaltsverzeichnisse/HT018120416.pdf

The fixes for #80 , #69 are not abstract enough to cover also those field normalizations. Question is: What to do? Could write e.g. a metafacture URI validator, or a URI fixer which tries to fix a URI. The latter would mean to program algorithms which could fix some URIs , but not all. Also possible would be to have just a validator and simply reject field which are not correctly catalogued.

dr0i commented 8 years ago

The indexing with the current URI enforcer fixes 90% of the lost records. Still 500 documents are not being indexed, thus there is room of imrovement (also see the helpful comments from @fsteeg ). Will switch the index to production (resulting in +4500 documents) while working on this ticket.

dr0i commented 8 years ago

@acka47 please also have a look at Q2 in https://github.com/hbz/lobid-resources/pull/87#issuecomment-230430184 . The mentioned "empty strings" are produced when URLs are recognized as broken, see e.g. the urn in field seeAlso of http://lobid.org/resource/HT004426370. The entry will be ignored, see http://gaia.hbz-nrw.de:9200/resources/resource/HT004426370.

acka47 commented 8 years ago

I agree with @dr0i to drop the triples in question by now and provide the Verbundgruppe with statistics of incorrect URL input. Maybe they will

  1. fix the entries and/or
  2. add a validation to field 655 u

If they react we can provide a list of affected resources.

dr0i commented 8 years ago

Deployed to production. With this, the API2.0 data are indexed in the ES version 2.3.3 at quaoar1.