vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.49k stars 586 forks source link

Missing field issue in schema while ingesting the documents #22263

Closed sunil924 closed 2 years ago

sunil924 commented 2 years ago

Schema fields are missing on some of the nodes We have existing vespa cluster running. For a new requirement, we added new fields of type bool and Array in the existing schema and redeployed the application. Application deployment was successful. No issue/error seen on console logs.

While trying to ingest documents in vespa using Jmeter, some of the requests( 2 request out of 10) got failed with following error: {"pathId":"/document/v1/myNamespace/mySchema/docid/67b2c355-04a1-4732-8d06-1bcf6218d54d","message":"No field 'KEY1_Array_B' in the structure of type 'mySchema', which has the fields: ........}

Same payload wassuccessfully persisted in subsequent calls. With the current state of cluster this issue is appearing frequently.

All the requests had same fields in payload.

Issue persisted even after redeployment of application.

Environment (please complete the following information):

Infrastructure: RedShift K8s cluster Versions :7.559.12

Sequence of events:

  1. Added new fields with type Array and Array in existing schema and tried to deploy
  2. Deployment failed as Array fields are not supported for bool
  3. Corrected the schema by changing Array fields to bool type and redeployed. Deployment was successful.
  4. Tried to ingest documents with Jmeter(1 thread, loop-10)

Can someone please help in understanding what is the issue and how to fix it.

jobergum commented 2 years ago

Application re-deployments (changing the application package) are not atomic across all services, in your case, it could be that your feed containers are no longer on in contact with the configuration server(s), hence running with the old application generation. The vespa-config-status tool can be used to check application generation across the services in the cluster.

This sample app can be used to study how to deploy a high availability deployment of Vespa https://github.com/vespa-engine/sample-apps/tree/master/examples/operations/multinode-HA, see also https://docs.vespa.ai/en/operations/configuration-server.html#troubleshooting

sunil924 commented 2 years ago

Thanks for your response Jo. Couple of questions:

  1. In this case, will redeployment not fix the issue?
  2. Deployment will not return the error status/exception even if it fails to deploy all the services successfully?
jobergum commented 2 years ago

1) Not necessarily but if this is a transient network split then yes 2) Applying the changes is asynchronous and not atomic.

bratseth commented 2 years ago

Some more details: The deploy call succeeds when the new configuration is loaded and validated on the config server. Convergence to the new config on each node happens asynchronously and may take a long time (minutes). If nodes are down or unable to reach the config server it may of course take even longer, so having the deploy command wait for this is not feasible. The nodes will keep trying to converge to the new config forever, so when missing connectivity is restored the system will self-heal. It is possible to discover the current config generation active on each node in the states/v1 API, and by reading the metric "generation".

jobergum commented 2 years ago

I'm resolving this, see the explanation above from @bratseth.