rode / grafeas-elasticsearch

An implementation of the Grafeas storage backend based on Elasticsearch
Apache License 2.0
12 stars 5 forks source link

Support querying occurrences by built artifacts #56

Closed alexashley closed 3 years ago

alexashley commented 3 years ago

This PR adds support for listing occurrences with a filter against the nested builtArtifacts in a build occurrence.

Elasticsearch mapping

Because of how Elasticsearch flattens documents, we updated the mapping on the occurrence index to treat builtArtifacts as a nested type. This is to allow for accurate filtering, otherwise unrelated results could be returned.

However, there are some operational issues with this change. Currently, we only configure the Elasticsearch index mappings when creating a project, so an existing instance of grafeas-elasticsearch would not have its mapping updated. We tried to update the mapping against an existing index, but that isn't allowed:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "cannot change object mapping from non-nested to nested"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "cannot change object mapping from non-nested to nested"
  },
  "status": 400
}

Unfortunately, it doesn't seem possible to do this in-place and perhaps not without some changes to grafeas-elasticsearch. I think a migration would go something like this:

Potentially we could allow for this to happen with a single reindex by adding some flag or environment variable that configured a suffix for the index name. Then an operator could create the index, reindex, and redeploy grafeas-elasticsearch.

Filtering updates

Originally we wanted to represent the operation of filtering against a nested array of objects by using the index (_[_]_ ) operator or a wildcard (like foo.*.name == "bar"). While working on that, we realized that the existing filters are based on using the field names as a string literal (e.g., "resource.uri" == "foo"); this doesn't match well with trying to use an operator against a field, as the operator is only parsed if the filter places the brackets outside of the string literal: "myArrayType"[0].

To allow for the index to work more naturally, we updated the parsing steps in the filtering package, borrowing a fair bit from cel-go's unparser. This allowed us to handle the select expressions that make up a field like resource.uri and make use of the index operator match a user's expectations a little better. However, in the final direction we went, we no longer needed these changes; that said, the syntax should match expectations now and it shouldn't impact continued use of the string literals.

Unfortunately, there isn't the wildcard syntax in cel that we thought was there and the index is otherwise not a good fit for trying to represent a nested query; in an expression like foo['*'].id == "abc", the current way we do the parsing wouldn't give us enough information to construct the query, we'd only have access to the literal '*' string.

So we experimented with using the filter macro (e.g., foo.filter(t, t.name == "abc")), but the expansion of the macro would force us to handle comprehensions and the conditional operator.

Finally, we settled on adding a new function, nestedFilter, to cel that we could intercept a little more easily. Not happy with the name, but we didn't want to clash with the filter macro. That said, as part of the parsing changes, we did turn off macros since we don't support pieces of the expanded representation. To enable easier parsing we didn't follow the same convention as the macro functions, which take an argument that represents the object in the array. That
turned out to be difficult to remove when we constructed the query.

tl;dr: Breaking Elasticsearch schema change and large changes to the filtering package.

ChrisSchreiber commented 3 years ago

Another way to handle schema changes with es is to use index aliases and versioned indexes...

  1. Start off by creating an index and an alias that points to it. The application uses the alias name and not the actually index indicies: my-index001 alias: my-index -> my-index001

  2. Create a new index with a new schema, move documents to new index and update alias indicies: my-index001, my-index002 alias: my-index -> my-index002

This way you can update the scheme by only reindexing once and you don't need to change the application config.

mrparkers commented 3 years ago

Also, we have an apiVersion for indices, so if we need to make changes, we can switch to v1beta2 instead of v1beta1. I'm not sure what the migration process would look like (or if the onus is on grafeas-elasticsearch to do that migration), but that's at least in place to help make an index change like this slightly easier.