hail-is / hail

Cloud-native genomic dataframes and batch computing
https://hail.is
MIT License
980 stars 246 forks source link

[query] In 0.2.126, the Dataproc OOMKiller kills the Hail driver process unexpectedly. #13960

Closed danking closed 10 months ago

danking commented 1 year ago

What happened?

No hail log file is available.

On 0.2.109: 5k samples and 8 interval lists -- WORKED 5k samples and 1 interval list -- WORKED On 0.2.120: 2k samples and 1 interval list -- WORKED On 0.2.120: 2k samples and 2 interval lists -- WORKED On 0.2.120: 2k samples and 4 interval list -- ERROR On 0.2.120: 2k samples and 8 interval list -- ERROR (edited)

All of these runs were on driver: 96 CPU/684G RAM Workers 4 CPU and 8GB RAM Spark configuration allocated 512GB for driver

I have tried the above in various configurations... Maybe a specific interval list is problematic, but that does not seem to be the case

The interval lists are the same across runs.

And lastly, the error is the usual Py4J Error. Usually I address this w/ more driver RAM, but I can't go any higher and this used to work fine in Hail 0.2.109.

I tried downgrading from 120-->109, but I don't believe that I can in Terra, due to Spark incompatibilities.

filtered_mt is a MatrixTable that has already been split and filtered (to drop irrelevant variants). By the time the [following] code blocks are run, filtered_mt = hl.read_matrix_table(filtered_mt_url) has been executed. Some more information: The code after this (not shown [in the below code blocks]) does additional filtering. If I skip the step variant_data.export(f"{variant_stat_file_path_stem}_FULL.tsv"), I can complete successfully. The issue is that we need the *_FULL.tsv output. So, I believe that this is likely a RAM issue on the driver, but this used to work.

variant_mt = generate_variant_stats(filtered_mt, interval_names, interval_table_dict)

# Main loop to compute variant stats and save to files

# File path stem to use for saving variant stats over different interval lists
variant_stat_file_path_stem = f"{bucket}/batchE/{workflow_nickname}/variant_stats"

variant_data = variant_mt.cols()
variant_data.describe()
#variant_data.to_pandas().to_csv(f"{variant_stat_file_path_stem}_FULL.tsv", sep='\t', index=False)
variant_data.export(f"{variant_stat_file_path_stem}_FULL.tsv")

def generate_variant_stats(mt, interval_names, interval_table_dict):
    mt = mt.annotate_rows(**interval_table_dict)
    annotate_dict = {interval_name + '_result': hl.agg.filter(mt[interval_name],
    hl.struct(
        ### SNP counts
        SNP_Ti_count_Het=hl.agg.count_where(
            (hl.is_transition(mt.alleles[0], mt.alleles[1])) & (mt.GT.is_het_ref())
        ),
        SNP_Tv_count_Het=hl.agg.count_where(
            (hl.is_transversion(mt.alleles[0], mt.alleles[1])) & (mt.GT.is_het_ref())
        ),
        SNP_Ti_count_Hom=hl.agg.count_where(
            (hl.is_transition(mt.alleles[0], mt.alleles[1])) & (mt.GT.is_hom_var())
        ),
        SNP_Tv_count_Hom=hl.agg.count_where(
            (hl.is_transversion(mt.alleles[0], mt.alleles[1])) & (mt.GT.is_hom_var())
        ),   
        ### Indel counts
        INDEL_Ins_count_Het=hl.agg.count_where(
            (hl.is_insertion(mt.alleles[0], mt.alleles[1])) & (mt.GT.is_het_ref())
        ),
        INDEL_Del_count_Het=hl.agg.count_where(
            (hl.is_deletion(mt.alleles[0], mt.alleles[1])) & (mt.GT.is_het_ref())
        ),
        INDEL_Ins_count_Hom=hl.agg.count_where(
            (hl.is_insertion(mt.alleles[0], mt.alleles[1])) & (mt.GT.is_hom_var())
        ),
        INDEL_Del_count_Hom=hl.agg.count_where(
            (hl.is_deletion(mt.alleles[0], mt.alleles[1])) & (mt.GT.is_hom_var())
        ),
    )) for interval_name in interval_names}

    mt2 = mt.annotate_cols(**annotate_dict)
    return mt2
interval_table_dict = dict(
    zip(interval_names, [hl.is_defined(interval_table[filtered_mt.locus]) for interval_table in interval_tables])
)

Version

0.2.126

Relevant log output

---------------------------------------------------------------------------
RemoteDisconnected                        Traceback (most recent call last)
File /opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py:703, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    702 # Make the request on the httplib connection object.
--> 703 httplib_response = self._make_request(
    704     conn,
    705     method,
    706     url,
    707     timeout=timeout_obj,
    708     body=body,
    709     headers=headers,
    710     chunked=chunked,
    711 )
    713 # If we're going to release the connection in ``finally:``, then
    714 # the response doesn't need to know about the connection. Otherwise
    715 # it will also try to release it and we'll have a double-release
    716 # mess.

File /opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py:449, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    445         except BaseException as e:
    446             # Remove the TypeError from the exception chain in
    447             # Python 3 (including for exceptions like SystemExit).
    448             # Otherwise it looks like a bug in the code.
--> 449             six.raise_from(e, None)
    450 except (SocketTimeout, BaseSSLError, SocketError) as e:

File <string>:3, in raise_from(value, from_value)

File /opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py:444, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    443 try:
--> 444     httplib_response = conn.getresponse()
    445 except BaseException as e:
    446     # Remove the TypeError from the exception chain in
    447     # Python 3 (including for exceptions like SystemExit).
    448     # Otherwise it looks like a bug in the code.

File /opt/conda/lib/python3.10/http/client.py:1375, in HTTPConnection.getresponse(self)
   1374 try:
-> 1375     response.begin()
   1376 except ConnectionError:

File /opt/conda/lib/python3.10/http/client.py:318, in HTTPResponse.begin(self)
    317 while True:
--> 318     version, status, reason = self._read_status()
    319     if status != CONTINUE:

File /opt/conda/lib/python3.10/http/client.py:287, in HTTPResponse._read_status(self)
    284 if not line:
    285     # Presumably, the server closed the connection before
    286     # sending a valid response.
--> 287     raise RemoteDisconnected("Remote end closed connection without"
    288                              " response")
    289 try:

RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

ProtocolError                             Traceback (most recent call last)
File /opt/conda/lib/python3.10/site-packages/requests/adapters.py:487, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    486 try:
--> 487     resp = conn.urlopen(
    488         method=request.method,
    489         url=url,
    490         body=request.body,
    491         headers=request.headers,
    492         redirect=False,
    493         assert_same_host=False,
    494         preload_content=False,
    495         decode_content=False,
    496         retries=self.max_retries,
    497         timeout=timeout,
    498         chunked=chunked,
    499     )
    501 except (ProtocolError, OSError) as err:

File /opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py:787, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    785     e = ProtocolError("Connection aborted.", e)
--> 787 retries = retries.increment(
    788     method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
    789 )
    790 retries.sleep()

File /opt/conda/lib/python3.10/site-packages/urllib3/util/retry.py:550, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
    549 if read is False or not self._is_method_retryable(method):
--> 550     raise six.reraise(type(error), error, _stacktrace)
    551 elif read is not None:

File /opt/conda/lib/python3.10/site-packages/urllib3/packages/six.py:769, in reraise(tp, value, tb)
    768 if value.__traceback__ is not tb:
--> 769     raise value.with_traceback(tb)
    770 raise value

File /opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py:703, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    702 # Make the request on the httplib connection object.
--> 703 httplib_response = self._make_request(
    704     conn,
    705     method,
    706     url,
    707     timeout=timeout_obj,
    708     body=body,
    709     headers=headers,
    710     chunked=chunked,
    711 )
    713 # If we're going to release the connection in ``finally:``, then
    714 # the response doesn't need to know about the connection. Otherwise
    715 # it will also try to release it and we'll have a double-release
    716 # mess.

File /opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py:449, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    445         except BaseException as e:
    446             # Remove the TypeError from the exception chain in
    447             # Python 3 (including for exceptions like SystemExit).
    448             # Otherwise it looks like a bug in the code.
--> 449             six.raise_from(e, None)
    450 except (SocketTimeout, BaseSSLError, SocketError) as e:

File <string>:3, in raise_from(value, from_value)

File /opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py:444, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    443 try:
--> 444     httplib_response = conn.getresponse()
    445 except BaseException as e:
    446     # Remove the TypeError from the exception chain in
    447     # Python 3 (including for exceptions like SystemExit).
    448     # Otherwise it looks like a bug in the code.

File /opt/conda/lib/python3.10/http/client.py:1375, in HTTPConnection.getresponse(self)
   1374 try:
-> 1375     response.begin()
   1376 except ConnectionError:

File /opt/conda/lib/python3.10/http/client.py:318, in HTTPResponse.begin(self)
    317 while True:
--> 318     version, status, reason = self._read_status()
    319     if status != CONTINUE:

File /opt/conda/lib/python3.10/http/client.py:287, in HTTPResponse._read_status(self)
    284 if not line:
    285     # Presumably, the server closed the connection before
    286     # sending a valid response.
--> 287     raise RemoteDisconnected("Remote end closed connection without"
    288                              " response")
    289 try:

ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
File <timed exec>:9

File <decorator-gen-1235>:2, in export(self, output, types_file, header, parallel, delimiter)

File ~/.local/lib/python3.10/site-packages/hail/typecheck/check.py:587, in _make_dec.<locals>.wrapper(__original_func, *args, **kwargs)
    584 @decorator
    585 def wrapper(__original_func: Callable[..., T], *args, **kwargs) -> T:
    586     args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 587     return __original_func(*args_, **kwargs_)

File ~/.local/lib/python3.10/site-packages/hail/table.py:1153, in Table.export(self, output, types_file, header, parallel, delimiter)
   1150 hl.current_backend().validate_file(output)
   1152 parallel = ir.ExportType.default(parallel)
-> 1153 Env.backend().execute(
   1154     ir.TableWrite(self._tir, ir.TableTextWriter(output, types_file, header, parallel, delimiter)))

File ~/.local/lib/python3.10/site-packages/hail/backend/backend.py:178, in Backend.execute(self, ir, timed)
    176 payload = ExecutePayload(self._render_ir(ir), '{"name":"StreamBufferSpec"}', timed)
    177 try:
--> 178     result, timings = self._rpc(ActionTag.EXECUTE, payload)
    179 except FatalError as e:
    180     raise e.maybe_user_error(ir) from None

File ~/.local/lib/python3.10/site-packages/hail/backend/py4j_backend.py:210, in Py4JBackend._rpc(self, action, payload)
    208 path = action_routes[action]
    209 port = self._backend_server_port
--> 210 resp = self._requests_session.post(f'http://localhost:{port}{path}', data=data)
    211 if resp.status_code >= 400:
    212     error_json = orjson.loads(resp.content)

File /opt/conda/lib/python3.10/site-packages/requests/sessions.py:635, in Session.post(self, url, data, json, **kwargs)
    624 def post(self, url, data=None, json=None, **kwargs):
    625     r"""Sends a POST request. Returns :class:`Response` object.
    626 
    627     :param url: URL for the new :class:`Request` object.
   (...)
    632     :rtype: requests.Response
    633     """
--> 635     return self.request("POST", url, data=data, json=json, **kwargs)

File /opt/conda/lib/python3.10/site-packages/requests/sessions.py:587, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    582 send_kwargs = {
    583     "timeout": timeout,
    584     "allow_redirects": allow_redirects,
    585 }
    586 send_kwargs.update(settings)
--> 587 resp = self.send(prep, **send_kwargs)
    589 return resp

File /opt/conda/lib/python3.10/site-packages/requests/sessions.py:701, in Session.send(self, request, **kwargs)
    698 start = preferred_clock()
    700 # Send the request
--> 701 r = adapter.send(request, **kwargs)
    703 # Total elapsed time of the request (approximately)
    704 elapsed = preferred_clock() - start

File /opt/conda/lib/python3.10/site-packages/requests/adapters.py:502, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    487     resp = conn.urlopen(
    488         method=request.method,
    489         url=url,
   (...)
    498         chunked=chunked,
    499     )
    501 except (ProtocolError, OSError) as err:
--> 502     raise ConnectionError(err, request=request)
    504 except MaxRetryError as e:
    505     if isinstance(e.reason, ConnectTimeoutError):
    506         # TODO: Remove this in 3.0.0: see #2811

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
danking commented 12 months ago

The script in question is located at: gs://danking/1_Generate_Variant_Stats_NVXvC_v1.py . Ping me if you need access.

danking commented 12 months ago

Whatever is failing here is likely different from the interval pipeline failures seen in https://github.com/hail-is/hail/issues/13748 and related tickets because GVS team has confirmed that 0.2.126 reduces peak RAM usage from >50GB to 11GB.

yonghaoy commented 11 months ago

Watching - AoU is seeing this error with 0.2.126

danking commented 11 months ago

@daniel-goldstein, here's a much smaller test-case, though the VDS is quite large. You might try this on gs://neale-bge/bge-wave-1.vds to see if we can replicate.

import hail as hl
hl.init(default_reference='GRCh38', idempotent=True)
vds = hl.vds.read_vds("gs://...")
test_intervals = ['chr13:32355250-32355251']
vds = hl.vds.filter_intervals(
    vds,
    [hl.parse_locus_interval(x,)
     for x in test_intervals])

Log statements like

java.net.UnknownHostException: Invalid host name: local host is: "...-w-1/10.128.0.50"; destination host is: "...-m.c.terra-vpc-sc-...internal.":8051

suggest to me that the master node is dying, then its removed from DNS, then workers are unable to communicate with it.

daniel-goldstein commented 11 months ago

This is a lot simpler, I'll try to run this tomorrow morning.

danking commented 11 months ago

Cannot reproduce in Dataproc or locally. @daniel-goldstein will contact Yonghao to set up a debug zoom where we can sort this out. Use slack thread to contact.

danking commented 11 months ago

An update.

I'm working with debugging info from the AoU VDS creation cluster. A VDS creation was run using an n1-highmem-8 driver. The cluster is created by hailctl with no custom driver settings

template for hailctl dataproc start [Source](https://github.com/broadinstitute/gatk/blob/ah_var_store/scripts/variantstore/wdl/extract/run_in_hail_cluster.py#L36C1-L48C1) ``` hailctl dataproc start --autoscaling-policy={autoscaling_policy} --worker-machine-type {worker_machine_type} --region {region} --project {gcs_project} --service-account {account} --num-master-local-ssds 1 --num-worker-local-ssds 1 --max-idle=60m --max-age=1440m --subnet=projects/{gcs_project}/regions/{region}/subnetworks/subnetwork {cluster_name} ```

I have the driver node syslogs as well as the Hail log file. For some reason all logs other than the Hail logs are missing from this file. We separately need to determine why all the Spark logs etc. are missing.

Based on the syslog, after system start up and just before the Jupyter notebook starts, the system is already using ~8,500MiB:

Nov 22 14:29:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem avail: 43808 of 52223 MiB (83.89%), swap free:    0 of    0 MiB ( 0.00%)

So, the effective maximum memory that Hail could possibly use is around 43808MiB.

After the Notebook and Spark initialize we're down to 42,700 MiB (about ~1000MiB more in use).

Nov 22 14:30:06 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem avail: 42760 of 52223 MiB (81.88%), swap free:    0 of    0 MiB ( 0.00%)

hailctl sets the VM RAM limit to 80% of the instance type's memory, so 80% * 52GiB = 42598MiB. This means the JVM is permitted to effectively use all the remaining memory.

At time of sigkill the total memory allocated by the JVM was about 2000MiB below the max heap size. Note that the heap is contained within all memory allocated by the JVM.

Nov 22 15:31:05 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem avail:    43 of 52223 MiB ( 0.08%), swap free:    0 of    0 MiB ( 0.00%)
Nov 22 15:31:09 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: low memory! at or below SIGTERM limits: mem  0.12%, swap  1.00%
Nov 22 15:31:09 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: sending SIGTERM to process 8421 uid 0 "java": badness 1852, VmRSS 40578 MiB

Indeed, the VmRSS is the memory in use from the kernel's perspective so it includes any off-heap memory created by Hail. The Hail log indicates the region pools are tiny, ~10s of MiB. Not a concern.

After the JVM is killed, memory jumps back up to 40683MiB (which checks out, that's roughly what the killed process was using).

Nov 22 15:31:10 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem avail: 40683 of 52223 MiB (77.90%), swap free:    0 of    0 MiB ( 0.00%)
danking commented 11 months ago

We ran the same pipeline with an n1-highmem-16 driver node and it made it through 50 sample groups (each sample group has ~4000 samples) before crashing. Unfortunately, we do not have the syslogs from this run. We also do not have the Hail log from this run. We do have the stdout/stderr from the Python process. There's not much of value there. The Python process exited with code 256. That doesn't make a lot of sense to me because exit codes should be an unsigned 8-bit integer.

On a highmem-16, total RAM is 106,496 MiB. Hail's JVM will use 85,197 MiB. We establish above that the system uses about 9,500 MiB (unclear if it would use more on a larger VM). This all leaves 11,799 MiB for the Python process. That seems extremely generous, but apparently not?

danking commented 11 months ago

crossposting from a message I sent to the variants team.


executive summary

Excess JVM memory use is almost certainly not the issue. I've taken a close look at the import_gvs.py loop and the related Hail Python code. No obvious accumulation of RAM use. AFAICT, the oomkiller keeps killing the pipelines. We need to stop this because the oomkiller (a) acts before the JVM GC can free things and (b) prevents us from getting JVM diagnostics on failure. We control the JVM's max heap with hailctl's --master-memory-fraction (default is 0.8 for 80% of the master machine type's advertised RAM). I suggest we set this down to 0.6 and continue using an n1-highmem-16 driver. If Hail is (incorrectly) accumulating garbage memory per-group, we'll have a better chance diagnosing that with a running JVM instead of one that's been SIGKILL'ed.

To understand what's going on, we gotta see what is using RAM in the n1-highmem-16 case. If I could SSH to the cluster, a simple solution is a screen with top -s 300 -n 100 >memory.log (I'd guess no more than 500KiB per hour of logs) and retrieve that file if the cluster fails. If we could get Google Monitoring set up to retrieve process-level memory statistics from the driver node that should also work.

Just to be clear, I don't anticipate any changes to Hail in the next week that would change the memory use of this pipeline. There could be a memory leak, but I have no clews that lead to it. I realize this is an unsatisfying answer. I'm pretty perplexed as to what could be the issue here.

technical details

We'll call the second to most recent run Run A and the most recent run Run B. Run A (like all runs before it) only manages two sample groups before failing. Run B made it through 50 groups before failing on 51.

Why did they fail? The syslog for Run A is clear: the oomkiller killed Run A. We lack syslogs for Run B, so we cannot be certain but the lack of a JVM stack trace suggests to me that (a) the driver failed and (b) the driver was killed by the system. Let's focus on the driver machines. In Run A, we used an n1-highmem-8 which is advertised to have 52GiB (53248 MiB). In Run B, we used an n1-highmem-16 which is advertised to have 104GiB (106,496 MiB). hailctl sets the JVM max heap size to 80% of the advertised RAM, so 42598 MiB (see hailctl's --master-memory-fraction). In Run A (the only run for which we have syslogs), based on the driver's syslog, before Spark starts, the system has already allocated 8500 MiB to Linux/Google/Dataproc daemons. Moreover, the actual RAM of the system (as reported by the earlyoom daemon) is 52223 MiB (51 GiB, 1GiB less than Google advertises for n1-highmem-8). Assuming these daemons never release their memory, all our user code must fit in 43723 MiB. Since the JVM's max heap is 42598 MiB, Python (and indeed, anything else on the system) is limited to allocating 1125 MiB.

I assume that an n1-highmem-16 uses the same amount of memory for system daemons, so I'd expect just over ten GiB that is used neither by system daemons nor the JVM. Assuming that's right, I can't explain why the oomkiller killed the JVM in Run B.

danking commented 11 months ago

I'm fairly certain I know understand this and the AoU VDS creation issue.

In Dataproc versions 1.5.74, 2.0.48, and 2.1.0, Dataproc introduced "memory protection" which is a euphemism for a newly aggressive OOMKiller. When the OOMKiller kills the JVM driver process, there is no hs_err_pid...log file, no exceptional log statements, and no clean shutdown of any sockets. The process is simply SIGTERM'ed and then SIGKILL'ed.

From Hail 0.2.83 through Hail 0.2.109 (released February 2023), Hail was pinned to Dataproc 2.0.44. From Hail 0.2.15 onwards, hailctl dataproc, by default, reserves 80% of the advertised memory of the driver node for the use of the Hail Query Driver JVM process. For example, Google advertises that an n1-highmem-8 has 52 GiB of RAM, so Hail sets the spark:spark.driver.memory property to 41g (we always round down). Before aggressive memory protection, this setting was sufficient to protect the driver from starving itself of memory.

Unfortunately, Hail 0.2.110 upgraded to Dataproc 2.1.2 which enabled "memory protection". Moreover, in the years since Hail 0.2.15, the memory in use by system processes on Dataproc driver nodes appears to have increased. Due to these two circumstances, the driver VM's memory usage can grow high enough to trigger the OOMKiller before the JVM triggers a GC. Consider, for example, these slices of the syslog of the n1-highmem-8 driver VM of a Dataproc cluster:

Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: earlyoom v1.6.2
Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem total: 52223 MiB, swap total:    0 MiB
Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: sending SIGTERM when mem <=  0.12% and swap <=  1.00%,
Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]:         SIGKILL when mem <=  0.06% and swap <=  0.50%
...
Nov 22 14:30:05 vds-cluster-91f3f4c1-b737-m post-hdfs-startup-script[7747]: + echo 'All done'
Nov 22 14:30:05 vds-cluster-91f3f4c1-b737-m post-hdfs-startup-script[7747]: All done
Nov 22 14:30:06 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem avail: 42760 of 52223 MiB (81.88%), swap free:    0 of    0 MiB ( 0.00%)

Notice:

  1. The total memory available on the machine is less than 52 GiB (= 53,248 MiB), indeed it is a full 1025 MiB below the advertised amount.
  2. Once all the components of the Dataproc cluster have started (but before any Hail Query jobs are submitted) the total memory available is already depleted to 42760 MiB. Recall that Hail allocates 41 GiB (= 41,984 MiB) to its JVM. This leaves the Python process and all other daemons on the system only 776 MiB of excess RAM. For reference python3 -c 'import hail' needs 206 MiB.

We must address this situation. It seems safe to assume that the system daemons will use a constant 9.5 GiB of RAM. Moreover the advertised RAM amount is at least 1 GiB larger than reality. I propose:

  1. The driver memory calculation in hailctl dataproc should take the advertised RAM amount, subtract 10.5 GiB, and then use 90% of the remaining value. For an n1-highmem-8, that reduces our allocation from 41 GiB to 37 GiB yielding an additional 4GiB to Python and deamon memory fluctuations.
  2. AoU RWB needs to review its memory settings for Spark driver nodes to ensure that the JVM is set to an appropriate maximum heap size.

For what it's worth, I think the reason we didn't get an outcry from our local scientific community is that many of them have transitioned to Query-on-Batch where we have exact and total control over the memory available to the driver and the workers.

danking commented 11 months ago

This ticket is complete when:

danking commented 11 months ago

I'll leave assigned to you, Daniel, to finalize the conversation with Yonghao.

danking commented 10 months ago

The AoU Researcher Workbench issue was, in the end, only superficially related to this issue. There is more explanation at https://github.com/DataBiosphere/leonardo/pull/4034 and in these slack threads:

  1. https://broadinstitute.slack.com/archives/CNK2A1QP3/p1699455847901999
  2. https://broadinstitute.slack.com/archives/CNK2A1QP3/p1701978402635369