mysociety / popit

DEPRECATED - Development on PopIt has stopped and it is no longer being maintained
https://goo.gl/Vvej4Q
Other
76 stars 33 forks source link

transformation of data before indexing in Elasticsearch needs to be more liberal in what it will deal with #795

Open mhl opened 9 years ago

mhl commented 9 years ago

It's the case at the moment that you can POST or PUT updates to PopIt entities that won't be indexed correctly in PopIt. This is really problematic because it means that results from the collections API endpoints won't match those from the search API endpoints.

This is of wider scope than https://github.com/mysociety/popit-api/issues/97 (which would still be really desirable to fix) since older PopIt instances already have data in them that won't be included in Elasticsearch on a reindex.

For example, some images have ended up with a created field set to the empty string, which gives this error:

org.elasticsearch.index.mapper.MapperParsingException: failed to parse [images.created]
Caused by: org.elasticsearch.index.mapper.MapperParsingException: failed to parse date field [], tried both date format [dateOptionalTime], and timestamp number with locale [null]
Caused by: java.lang.IllegalArgumentException: Invalid format: ""

... presumably since Elasticsearch infers from its name that created is expected to be a timestamp.

Some records just produce the unhelpful error:

org.elasticsearch.index.mapper.MapperParsingException: failed to parse [images.created]

... e.g. on reindexing membership 5501e2f855f04a60241b0951 in yournextmp.popit.mysociety.org.

As another example, reindexing person 5799 gives you:

org.elasticsearch.index.mapper.MapperParsingException: failed to parse [versions.data.last_party.version_id]
Caused by: org.elasticsearch.ElasticSearchIllegalArgumentException: unknown property [information_source]

There are other examples in the Elasticsearch logs around 2015-03-14 15:40.

In most cases these are due to bad data getting into the database in the past due to one bug or another (e.g. there should never be a last_party in versions.data) and I have some scripts to clean up those particular cases; however, that such cases will occur seems to be inevitable with an arbitrarily extensible schema as PopIt has.

I can't see an easy way of generally solving this for existing broken data, so maybe there's nothing to do here - but https://github.com/mysociety/popit-api/issues/97 would be a great help in stopping broken data getting in in the first place.