Support failover for HA nodes

skyegecko commented 5 years ago

Hi,

It seems that HdfsCli doesn't cope well when a WebHDFS namenode that it is connected to shuts down even though an alternative node is available.

I don't have a minimal producer and my logging isn't 100% clear, but if I'm not mistaken the following happened during a run of an application I'm testing:

The cluster administrator has initiated a restart of the namenodes for a settings update. There are two namenodes in the cluster.
At the time that HdfsCli attempts to make its first connection, the main node 001 is in the process of restarting and unavailable. It successfully connects to the alternative node 002.
Node 001 has finished restarting. Node 002 goes offline to restart.
HdfsCli loses its HTTP connection to node 002. It retries its connection to node 002. It does not attempt a new connection to node 001.

Stacktrace from Python (names changed to protect the innocent):

  File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/urllib3/connection.py", line 159, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/urllib3/util/connection.py", line 80, in create_connection
    raise err
  File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/urllib3/util/connection.py", line 70, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/urllib3/connectionpool.py", line 354, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib64/python3.4/http/client.py", line 1137, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib64/python3.4/http/client.py", line 1182, in _send_request
    self.endheaders(body)
  File "/usr/lib64/python3.4/http/client.py", line 1133, in endheaders
    self._send_output(message_body)
  File "/usr/lib64/python3.4/http/client.py", line 963, in _send_output
    self.send(msg)
  File "/usr/lib64/python3.4/http/client.py", line 898, in send
    self.connect()
  File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/urllib3/connection.py", line 181, in connect
    conn = self._new_conn()
  File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/urllib3/connection.py", line 168, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fe008d72fd0>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/urllib3/util/retry.py", line 399, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host=<node 002>, port=50070): Max retries exceeded with url: <webhdfs path>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/worker.py", line 229, in main
    process()
  File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/worker.py", line 224, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/serializers.py", line 372, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/grid/3/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000001/<our package>", line 267, in extract_map
  File "/usr/lib64/python3.4/contextlib.py", line 59, in __enter__
    return next(self.gen)
  File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/hdfs/client.py", line 686, in read
    buffersize=buffer_size,
  File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/hdfs/client.py", line 125, in api_handler
    raise err
  File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/hdfs/client.py", line 107, in api_handler
    **self.kwargs
  File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/hdfs/client.py", line 214, in _request
    **kwargs
  File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host=<node 002>, port=50070): Max retries exceeded with url: <webhdfs path>

Looking at issue #18 you rightly state that handling failovers is difficult. Is this something you're willing to look into? I can attempt to create a minimal producer or test case, but I'd rather not if you regard this issue as out of scope.

Many thanks!

mtth commented 4 years ago

Hi @dutchgecko, apologies for the slow reply. https://github.com/mtth/hdfs/pull/80 introduced HA support, I'd like to understand why it doesn't work here. Are you emitting a single large request that spans both restarts?

If that's the case, it makes sense that the overall request fails when the second server restarts since the first server had already been tried for the same request. Supporting such long requests will require care, to avoid infinite loops. I'm inclined to think that such sequences of events are rare enough that retrying them transparently is not worth the added complexity.

skyegecko commented 4 years ago

Yes, in this case the request spanned both restarts (or almost: the request started while NN01 was down, NN02 went down during the request).

In this case the request is quite long because it is a streaming operation: files are streamed from HDFS, operations carried out on the stream, and the result streamed to a new file on HDFS.

I respect your decision if you regard this as WONTFIX due to complexity, however in my opinion the current behaviour is not as robust as the HA mechanisms used in other Hadoop tools (a Spark job running at the same time for example handled the double failover gracefully). I agree that rebooting both namenodes is a rare occurrence, however part of the appeal of a system like Hadoop is that it should always be "safe" to do so (or said another way, I had to deal with a grumpy sysadmin who wondered why he had to manually restart tasks that shouldn't have failed during the reboot he initiated 😉).

mtth / hdfs

Support failover for HA nodes #142