Closed danking closed 10 months ago
The script in question is located at: gs://danking/1_Generate_Variant_Stats_NVXvC_v1.py . Ping me if you need access.
Whatever is failing here is likely different from the interval pipeline failures seen in https://github.com/hail-is/hail/issues/13748 and related tickets because GVS team has confirmed that 0.2.126 reduces peak RAM usage from >50GB to 11GB.
Watching - AoU is seeing this error with 0.2.126
@daniel-goldstein, here's a much smaller test-case, though the VDS is quite large. You might try this on gs://neale-bge/bge-wave-1.vds
to see if we can replicate.
import hail as hl
hl.init(default_reference='GRCh38', idempotent=True)
vds = hl.vds.read_vds("gs://...")
test_intervals = ['chr13:32355250-32355251']
vds = hl.vds.filter_intervals(
vds,
[hl.parse_locus_interval(x,)
for x in test_intervals])
Log statements like
java.net.UnknownHostException: Invalid host name: local host is: "...-w-1/10.128.0.50"; destination host is: "...-m.c.terra-vpc-sc-...internal.":8051
suggest to me that the master node is dying, then its removed from DNS, then workers are unable to communicate with it.
This is a lot simpler, I'll try to run this tomorrow morning.
Cannot reproduce in Dataproc or locally. @daniel-goldstein will contact Yonghao to set up a debug zoom where we can sort this out. Use slack thread to contact.
An update.
I'm working with debugging info from the AoU VDS creation cluster. A VDS creation was run using an n1-highmem-8 driver. The cluster is created by hailctl with no custom driver settings
I have the driver node syslogs as well as the Hail log file. For some reason all logs other than the Hail logs are missing from this file. We separately need to determine why all the Spark logs etc. are missing.
Based on the syslog, after system start up and just before the Jupyter notebook starts, the system is already using ~8,500MiB:
Nov 22 14:29:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem avail: 43808 of 52223 MiB (83.89%), swap free: 0 of 0 MiB ( 0.00%)
So, the effective maximum memory that Hail could possibly use is around 43808MiB.
After the Notebook and Spark initialize we're down to 42,700 MiB (about ~1000MiB more in use).
Nov 22 14:30:06 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem avail: 42760 of 52223 MiB (81.88%), swap free: 0 of 0 MiB ( 0.00%)
hailctl
sets the VM RAM limit to 80% of the instance type's memory, so 80% * 52GiB = 42598MiB. This means the JVM is permitted to effectively use all the remaining memory.
At time of sigkill the total memory allocated by the JVM was about 2000MiB below the max heap size. Note that the heap is contained within all memory allocated by the JVM.
Nov 22 15:31:05 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem avail: 43 of 52223 MiB ( 0.08%), swap free: 0 of 0 MiB ( 0.00%)
Nov 22 15:31:09 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: low memory! at or below SIGTERM limits: mem 0.12%, swap 1.00%
Nov 22 15:31:09 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: sending SIGTERM to process 8421 uid 0 "java": badness 1852, VmRSS 40578 MiB
Indeed, the VmRSS is the memory in use from the kernel's perspective so it includes any off-heap memory created by Hail. The Hail log indicates the region pools are tiny, ~10s of MiB. Not a concern.
After the JVM is killed, memory jumps back up to 40683MiB (which checks out, that's roughly what the killed process was using).
Nov 22 15:31:10 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem avail: 40683 of 52223 MiB (77.90%), swap free: 0 of 0 MiB ( 0.00%)
We ran the same pipeline with an n1-highmem-16 driver node and it made it through 50 sample groups (each sample group has ~4000 samples) before crashing. Unfortunately, we do not have the syslogs from this run. We also do not have the Hail log from this run. We do have the stdout/stderr from the Python process. There's not much of value there. The Python process exited with code 256. That doesn't make a lot of sense to me because exit codes should be an unsigned 8-bit integer.
On a highmem-16, total RAM is 106,496 MiB. Hail's JVM will use 85,197 MiB. We establish above that the system uses about 9,500 MiB (unclear if it would use more on a larger VM). This all leaves 11,799 MiB for the Python process. That seems extremely generous, but apparently not?
crossposting from a message I sent to the variants team.
Excess JVM memory use is almost certainly not the issue. I've taken a close look at the import_gvs.py loop and the related Hail Python code. No obvious accumulation of RAM use. AFAICT, the oomkiller keeps killing the pipelines. We need to stop this because the oomkiller (a) acts before the JVM GC can free things and (b) prevents us from getting JVM diagnostics on failure. We control the JVM's max heap with hailctl's --master-memory-fraction (default is 0.8 for 80% of the master machine type's advertised RAM). I suggest we set this down to 0.6 and continue using an n1-highmem-16 driver. If Hail is (incorrectly) accumulating garbage memory per-group, we'll have a better chance diagnosing that with a running JVM instead of one that's been SIGKILL'ed.
To understand what's going on, we gotta see what is using RAM in the n1-highmem-16 case. If I could SSH to the cluster, a simple solution is a screen with top -s 300 -n 100 >memory.log (I'd guess no more than 500KiB per hour of logs) and retrieve that file if the cluster fails. If we could get Google Monitoring set up to retrieve process-level memory statistics from the driver node that should also work.
Just to be clear, I don't anticipate any changes to Hail in the next week that would change the memory use of this pipeline. There could be a memory leak, but I have no clews that lead to it. I realize this is an unsatisfying answer. I'm pretty perplexed as to what could be the issue here.
We'll call the second to most recent run Run A and the most recent run Run B. Run A (like all runs before it) only manages two sample groups before failing. Run B made it through 50 groups before failing on 51.
Why did they fail? The syslog for Run A is clear: the oomkiller killed Run A. We lack syslogs for Run B, so we cannot be certain but the lack of a JVM stack trace suggests to me that (a) the driver failed and (b) the driver was killed by the system. Let's focus on the driver machines. In Run A, we used an n1-highmem-8 which is advertised to have 52GiB (53248 MiB). In Run B, we used an n1-highmem-16 which is advertised to have 104GiB (106,496 MiB). hailctl sets the JVM max heap size to 80% of the advertised RAM, so 42598 MiB (see hailctl's --master-memory-fraction). In Run A (the only run for which we have syslogs), based on the driver's syslog, before Spark starts, the system has already allocated 8500 MiB to Linux/Google/Dataproc daemons. Moreover, the actual RAM of the system (as reported by the earlyoom daemon) is 52223 MiB (51 GiB, 1GiB less than Google advertises for n1-highmem-8). Assuming these daemons never release their memory, all our user code must fit in 43723 MiB. Since the JVM's max heap is 42598 MiB, Python (and indeed, anything else on the system) is limited to allocating 1125 MiB.
I assume that an n1-highmem-16 uses the same amount of memory for system daemons, so I'd expect just over ten GiB that is used neither by system daemons nor the JVM. Assuming that's right, I can't explain why the oomkiller killed the JVM in Run B.
I'm fairly certain I know understand this and the AoU VDS creation issue.
In Dataproc versions 1.5.74, 2.0.48, and 2.1.0, Dataproc introduced "memory protection" which is a euphemism for a newly aggressive OOMKiller. When the OOMKiller kills the JVM driver process, there is no hs_err_pid...log file, no exceptional log statements, and no clean shutdown of any sockets. The process is simply SIGTERM'ed and then SIGKILL'ed.
From Hail 0.2.83 through Hail 0.2.109 (released February 2023), Hail was pinned to Dataproc 2.0.44. From Hail 0.2.15 onwards, hailctl dataproc
, by default, reserves 80% of the advertised memory of the driver node for the use of the Hail Query Driver JVM process. For example, Google advertises that an n1-highmem-8 has 52 GiB of RAM, so Hail sets the spark:spark.driver.memory
property to 41g
(we always round down). Before aggressive memory protection, this setting was sufficient to protect the driver from starving itself of memory.
Unfortunately, Hail 0.2.110 upgraded to Dataproc 2.1.2 which enabled "memory protection". Moreover, in the years since Hail 0.2.15, the memory in use by system processes on Dataproc driver nodes appears to have increased. Due to these two circumstances, the driver VM's memory usage can grow high enough to trigger the OOMKiller before the JVM triggers a GC. Consider, for example, these slices of the syslog of the n1-highmem-8 driver VM of a Dataproc cluster:
Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: earlyoom v1.6.2
Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem total: 52223 MiB, swap total: 0 MiB
Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: sending SIGTERM when mem <= 0.12% and swap <= 1.00%,
Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: SIGKILL when mem <= 0.06% and swap <= 0.50%
...
Nov 22 14:30:05 vds-cluster-91f3f4c1-b737-m post-hdfs-startup-script[7747]: + echo 'All done'
Nov 22 14:30:05 vds-cluster-91f3f4c1-b737-m post-hdfs-startup-script[7747]: All done
Nov 22 14:30:06 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem avail: 42760 of 52223 MiB (81.88%), swap free: 0 of 0 MiB ( 0.00%)
Notice:
python3 -c 'import hail'
needs 206 MiB.We must address this situation. It seems safe to assume that the system daemons will use a constant 9.5 GiB of RAM. Moreover the advertised RAM amount is at least 1 GiB larger than reality. I propose:
hailctl dataproc
should take the advertised RAM amount, subtract 10.5 GiB, and then use 90% of the remaining value. For an n1-highmem-8, that reduces our allocation from 41 GiB to 37 GiB yielding an additional 4GiB to Python and deamon memory fluctuations.For what it's worth, I think the reason we didn't get an outcry from our local scientific community is that many of them have transitioned to Query-on-Batch where we have exact and total control over the memory available to the driver and the workers.
This ticket is complete when:
I'll leave assigned to you, Daniel, to finalize the conversation with Yonghao.
The AoU Researcher Workbench issue was, in the end, only superficially related to this issue. There is more explanation at https://github.com/DataBiosphere/leonardo/pull/4034 and in these slack threads:
What happened?
No hail log file is available.
Version
0.2.126
Relevant log output