vesoft-inc / nebula-go

Nebula client in Golang
Apache License 2.0
134 stars 70 forks source link

`Space was not chosen` when connecting to LB in cloud with HTTP2 #287

Closed wenhaocs closed 12 months ago

wenhaocs commented 1 year ago

Please check the FAQ documentation before raising an issue

Describe the bug (required) When using HTTP/2 mode, there will be small % of requests failing due to error "query failed with error code -1009 and error message SemanticError: Space was not chosen."

Image

Your Environments (required) AWS with LB

How To Reproduce(required)

Steps to reproduce the behavior:

  1. Step 1
  2. Step 2
  3. Step 3

Expected behavior

Additional context

wenhaocs commented 1 year ago

Based on experiment from @HarrisChu This is because the LB has multiple connections to the server side. For every session we create at client side, we will call use space. However, if the LB has more connections than the number of sessions, it is possible some connections is not associated with the session calling use graph. On the other hand, HTTP server is usually stateless. On request, it may or may not reuse the previous connection with which use space was executed.

HarrisChu commented 1 year ago

image image

client -> :8080 -> :9119 and :9449

when

  1. create session and use space with connection to :9119
  2. session space info is not sync to meta.
  3. and then execute statement with connection to :9449

and then it would cause Space was not chosen error.

wenhaocs commented 1 year ago

和sc确认了,使用Envoy后,同一个client的不同Http stream,可能会hit到不同的graphd。所以每次创建新的session后,需要立马写入meta。添加一个开关来切换

HarrisChu commented 1 year ago

refer https://github.com/vesoft-inc/nebula/pull/2833/

xiajingchun commented 1 year ago

确定要通过修改内核来实现,即在每次创建session,use space直接上报给meta,是否通过flag还是只做在sc分支由研发来决定。这个方案也会有些不足的地方,比如1)每次session claim的时候graphd会上报更多的信息给到meta;2)session idle timeout带来的重试问题等。 另外,sticky session的方式即使sc的LB支持,我们会需要修改fbthrift http2的实现来支持cookie,另外这个方案在HPA下也会有问题,不能完全满足需求。