It's the case at the moment that you can POST or PUT updates to PopIt entities that won't be indexed correctly in PopIt. This is really problematic because it means that results from the collections API endpoints won't match those from the search API endpoints.
This is of wider scope than https://github.com/mysociety/popit-api/issues/97 (which would still be really desirable to fix) since older PopIt instances already have data in them that won't be included in Elasticsearch on a reindex.
For example, some images have ended up with a created field set to the empty string, which gives this error:
org.elasticsearch.index.mapper.MapperParsingException: failed to parse [images.created]
Caused by: org.elasticsearch.index.mapper.MapperParsingException: failed to parse date field [], tried both date format [dateOptionalTime], and timestamp number with locale [null]
Caused by: java.lang.IllegalArgumentException: Invalid format: ""
... presumably since Elasticsearch infers from its name that created is expected to be a timestamp.
Some records just produce the unhelpful error:
org.elasticsearch.index.mapper.MapperParsingException: failed to parse [images.created]
... e.g. on reindexing membership 5501e2f855f04a60241b0951 in yournextmp.popit.mysociety.org.
As another example, reindexing person 5799 gives you:
There are other examples in the Elasticsearch logs around 2015-03-14 15:40.
In most cases these are due to bad data getting into the database in the past due to one bug or another (e.g. there should never be a last_party in versions.data) and I have some scripts to clean up those particular cases; however, that such cases will occur seems to be inevitable with an arbitrarily extensible schema as PopIt has.
I can't see an easy way of generally solving this for existing broken data, so maybe there's nothing to do here - but https://github.com/mysociety/popit-api/issues/97 would be a great help in stopping broken data getting in in the first place.
It's the case at the moment that you can POST or PUT updates to PopIt entities that won't be indexed correctly in PopIt. This is really problematic because it means that results from the collections API endpoints won't match those from the search API endpoints.
This is of wider scope than https://github.com/mysociety/popit-api/issues/97 (which would still be really desirable to fix) since older PopIt instances already have data in them that won't be included in Elasticsearch on a reindex.
For example, some images have ended up with a
created
field set to the empty string, which gives this error:... presumably since Elasticsearch infers from its name that
created
is expected to be a timestamp.Some records just produce the unhelpful error:
... e.g. on reindexing membership 5501e2f855f04a60241b0951 in yournextmp.popit.mysociety.org.
As another example, reindexing person 5799 gives you:
There are other examples in the Elasticsearch logs around 2015-03-14 15:40.
In most cases these are due to bad data getting into the database in the past due to one bug or another (e.g. there should never be a
last_party
inversions.data
) and I have some scripts to clean up those particular cases; however, that such cases will occur seems to be inevitable with an arbitrarily extensible schema as PopIt has.I can't see an easy way of generally solving this for existing broken data, so maybe there's nothing to do here - but https://github.com/mysociety/popit-api/issues/97 would be a great help in stopping broken data getting in in the first place.