skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.59k stars 478 forks source link

Storage: Goofys OOM-killed when mounting/using a big bucket #706

Closed concretevitamin closed 9 months ago

concretevitamin commented 2 years ago

User report:

I was trying out the s3 file system mount. With some config like:

file_mounts:
  /data/xxx:
    source: s3://some-giant-bucket
    mode: MOUNT

This worked ok initially and I was able to start a training job and run initialization of the dataloader. However half way through I’m starting to get errors like

OSError: [Errno 107] Transport endpoint is not connected: '/data/xxx/....json'

And if I check the mounted folder it’s gone (also gone in df -h). When I first started the instance it was OK though.

from syslog, looks like goofys ran out of memory lol

Apr  4 12:03:28 ip-172-16-29-213 kernel: [ 2497.250427] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-11.scope,task=goofys,pid=26769,uid=1000
Apr  4 12:03:28 ip-172-16-29-213 kernel: [ 2497.250468] Out of memory: Killed process 26769 (goofys) total-vm:9484132kB, anon-rss:9350624kB, file-rss:0kB, shmem-rss:0kB, UI

using a g4dn.xlarge currently.

Usage pattern

TODO: reproduce, investigate, mitigate?

romilbhardwaj commented 2 years ago

EDIT - see comments below.

Initial investigation:

TL;DR

☠☠☠Recursive ls operations on big buckets kills goofys☠☠☠

✅✅✅goofys is fine when ls operations are done on small dirs. Using generators for recursively going through files even in big buckets is ok. ✅✅✅

Setup

Test bucket

s3://fah-public-data-covid19-cryptic-pockets seems like a good candidate - ~I don't know how big it is because ls takes a very long time (good sign)~ Website says its 7TB. Has 4-5 level deep heirarchy of folder with many 5 MB files.

Cluster YAML

name: jupyter

resources:
  cloud: aws

file_mounts:
  /covid:
    source: s3://fah-public-data-covid19-cryptic-pockets
    mode: MOUNT

setup: |
  pip install --upgrade pip
  conda init bash
  conda activate jupyter
  conda create -n jupyter python=3.9 -y
  conda activate jupyter
  pip install jupyter

run: |
  conda activate jupyter
  jupyter notebook --port 8888

Investigation

Wrote some code to parallel read files and print size:

# Read all files in a directory recursively in parallel

import os
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import as_completed

CAUSE_CRASH = False

def sizeof_fmt(num, suffix="B"):
    for unit in ["", "Ki", "Mi", "Gi", "Ti", "Pi", "Ei", "Zi"]:
        if abs(num) < 1024.0:
            return f"{num:3.1f}{unit}{suffix}"
        num /= 1024.0
    return f"{num:.1f}Yi{suffix}"

def read_file(file):
    print(file)
    with open(file, 'rb') as f:
        x=f.read()
        #print("File: " + file)
        #print("Size of file is :", sizeof_fmt(int(f.tell())))

def read_parallel(path):
    files = []
    print(f"Reading files in parallel {path}")
    if CAUSE_CRASH:
        ls_done = 0
        for root, dirs, files in os.walk(path):
            files = [os.path.join(root, file) for file in files]
            ls_done += 1
            if ls_done % 10 == 0:
                print("ls done: ", ls_done)
        print("Got files: ", len(files))
    done = 0
    for root, dirs, files in os.walk(path): # Uses a generator
        with ThreadPoolExecutor(max_workers=64) as executor:
            futures = [executor.submit(read_file, os.path.join(root, file)) for file in files]
            for future in as_completed(futures):
                done += 1
                if done % 10 == 0:
                    print("Reads done: ", done)
                print(future.result())

if __name__ == '__main__':
    read_parallel('/covid')

Turns out this works fine. Memory usage of goofys was stable around 37 MB for about 15 min of reading files (after which I stopped the process).

Afterwards, I added these lines before starting the os.walk loop:

    for root, dirs, files in os.walk(path):
        files = [os.path.join(root, file) for file in files]
    print("Got files: ", len(files))

This kills goofys because it's no longer a generator (there's something going on under the hood which translates this into a recursive stat syscall). The memory requirement keeps climbing in a bursty pattern (increases by ~1 GB in a few seconds, then stays there for a couple of min before increasing again). The command kept running for about 20 min and goofys was using about 6 GB in memory after which I ran out of patience. It never OOM'd, but I can see how it may happen for other workloads/buckets.

So what do we do?

Recursive ls is a fundamental challenge on these object stores (aws s3 ls also gets stuck for a long time, though it doesn't leak memory). Goofys probably inherits this problem from the fact that it's backed by an object store.

We could mention in our docs that we should avoid recursive ls operations. Maybe encourage the use of generators to do small ls ops. Any other ideas?

TODO

Michaelvll commented 2 years ago

I think one common workload for ML Dataset class is something like the following:

class Dataset():
  def __init__(self):
    self.train_files: List[Path]
    self.valid_files: List[Path]
    self.test_files: List[Path]

For training (especially for ImageNet), the dataloader will wrap the dataset, and randomly index the self.train_files to form each batch and read the files in the batch. I suspect the reason, Rocky met the problem with OOM would be that before the training starts they will traverse the whole directory, and find out all the file_paths under train/ folder to form the self.train_files list.

I think maybe one way is to ask the user to keep a metadata file train.meta in their s3 bucket that lists all the paths of the training samples.

concretevitamin commented 2 years ago

@romilbhardwaj It seems like the two versions don't differ a lot? Why would one trigger the use of a generator but not the other? OK:

    for root, dirs, files in os.walk(path): # Uses a generator
        with ThreadPoolExecutor(max_workers=64) as executor:
            futures = [executor.submit(read_file, os.path.join(root, file)) for file in files]

Not-OK:

    for root, dirs, files in os.walk(path):
        files = [os.path.join(root, file) for file in files]

@Michaelvll Great point. We should follow up with him to check.

romilbhardwaj commented 2 years ago

great point @concretevitamin - my crude benchmark above was a bit unfair. I was running both scripts for the same amount of time, but the script which was reading files (the one I labelled to ok) was processing at about 10x slower rate, so effectively it's memory util was also going to grow over time.

I updated the script above to also print number of file ls'd and number of files read. After reading 60K files (CAUSE_CRASH=False in the script above), the memory utilization of goofys was 371MB (baseline is about 40 mb at init). After ls'ing 60K files (CAUSE_CRASH=True), the memory utilization was ~500MB. So I guess the number of ls calls is problematic.

Todo

romilbhardwaj commented 2 years ago

This is partially alleviated with #726 by reducing goofy's memory utilization by roughly 30-40%, but still OOMs. Here's a an awesome benchmark from @Michaelvll:

For reference, I tried this branch with the following codes adapted from the code provided by @romilbhardwaj from #622 doc.

It seems read-only mode has no effect to memory usage.

Benchmarks on an m4.2xlarge:

image

wandb panel File listing speed Mem Increase (GB) / M files
bbe678be (romil’s branch) 90 K /s ~3
Read-only option (-o ro) 90 K /s ~3

import os
import psutil
import wandb

wandb.init(project='goofys-profile')

path = '/data_ro'

file_cnt = 0
for root, dirs, files in os.walk(path):
    for file in files:
        fp = os.path.join(root, file)
        if file_cnt % 100000 == 0:
            wandb.log({'num_files': file_cnt, 'mem_used_gb': psutil.virtual_memory().used / 2**30})
        file_cnt += 1

Since we still OOM after some time, I'll leave this issue open for now.

TODO:

  • Investigate if increasing stat and type cache ttls helps
  • Investigate rclone as an alternative mounting tool
mraheja commented 2 years ago

I tried benchmarking using mounting with rclone and it seems like the os.walk speed is much, much lower and the memory usage still increases as we read more files:

image

https://wandb.ai/mraheja/rclone-benchmark

The benchmark code I used:

import os
import psutil
import sys
import wandb

wandb.init(project='rclone-benchmark', name = sys.argv[1])

path = ...

file_cnt = 0
for root, dirs, files in os.walk(path):
    for file in files:
        fp = os.path.join(root, file)
        if file_cnt % 1000 == 0:
            wandb.log({'num_files': file_cnt, 'mem_used_gb': psutil.virtual_memory().used / 2**30})
        file_cnt += 1

The three options for mounting I tried:

rclone mount remote:fah-public-data-covid19-cryptic-pockets ./data
rclone mount remote:fah-public-data-covid19-cryptic-pockets --read-only ./data_readonly
rclone mount remote:fah-public-data-covid19-cryptic-pockets ./data_cached --vfs-cache-mode full 

These were done on a t2.small instance in us-east-2 although I did get similar results when running locally too.

I don't know whether this is worth considering further given the speed limitation and also given that it seems like it will also run into OOM, but I'm also not entirely sure why it runs into these issues. Running rclone ls directly lists files pretty quickly so unsure why os.walk on the mounted version is so much slower.

concretevitamin commented 2 years ago

Nice investigation, @mraheja.

Running rclone ls directly lists files pretty quickly so unsure why os.walk on the mounted version is so much slower.

Is there a GitHub issue with rclone on this?

romilbhardwaj commented 2 years ago

Thanks @mraheja! This are very useful insights.

  1. Did you actually encounter an rclone OOM? If not, can you run this for longer (say, a day) on a larger instance (m5 or p3) so that we have a fair performance comparison? t2 instances are severely compute and bandwidth constrained

  2. Have tried setting --checkers 1 as mentioned here: https://forum.rclone.org/t/oom-with-big-buckets/19140/10

Even if it's slower, I would be interested in using rclone instead of goofys if it is more reliable. The intention of MOUNT mode is to support writes and provide easy access to larger buckets. Performance, while nice to have, is not as hard a requirement as stability is (as discussed in the proposal: https://docs.google.com/document/u/1/d/1uVNTWk44lvzPFTr9flfPZrs-sT6fiKlqnAPROprwEmU)

mraheja commented 2 years ago

Is there a GitHub issue with rclone on this?

Not that I see--but I'm also not sure if we should expecting them to run at the same speed.

@romilbhardwaj I did not encounter an OOM or try --checkers 1, I'll try running a short job with checkers and then try the best settings for a day to test stability.

Also, given that the purpose is to support writes, do you think the benchmark should be modified to write something to these files?

mraheja commented 2 years ago

Turns out rclone doesn't run much faster with --checkers 1 flag and also runs into an OOM eventually: https://wandb.ai/mraheja/rclone-benchmark. I was using a m4.2xlarge and it did seem to allow for significantly more files than Goofys before running into that OOM though.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

romilbhardwaj commented 1 year ago

Marking as not stale. Though we haven't heard many OOM complaints from users, we should close this issue only after verifying this problem no longer exists.

github-actions[bot] commented 9 months ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 9 months ago

This issue was closed because it has been stalled for 10 days with no activity.