Closed concretevitamin closed 9 months ago
EDIT - see comments below.
Initial investigation:
☠☠☠Recursive ls
operations on big buckets kills goofys☠☠☠
✅✅✅goofys is fine when ls operations are done on small dirs. Using generators for recursively going through files even in big buckets is ok. ✅✅✅
s3://fah-public-data-covid19-cryptic-pockets
seems like a good candidate - ~I don't know how big it is because ls takes a very long time (good sign)~ Website says its 7TB. Has 4-5 level deep heirarchy of folder with many 5 MB files.
name: jupyter
resources:
cloud: aws
file_mounts:
/covid:
source: s3://fah-public-data-covid19-cryptic-pockets
mode: MOUNT
setup: |
pip install --upgrade pip
conda init bash
conda activate jupyter
conda create -n jupyter python=3.9 -y
conda activate jupyter
pip install jupyter
run: |
conda activate jupyter
jupyter notebook --port 8888
Wrote some code to parallel read files and print size:
# Read all files in a directory recursively in parallel
import os
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import as_completed
CAUSE_CRASH = False
def sizeof_fmt(num, suffix="B"):
for unit in ["", "Ki", "Mi", "Gi", "Ti", "Pi", "Ei", "Zi"]:
if abs(num) < 1024.0:
return f"{num:3.1f}{unit}{suffix}"
num /= 1024.0
return f"{num:.1f}Yi{suffix}"
def read_file(file):
print(file)
with open(file, 'rb') as f:
x=f.read()
#print("File: " + file)
#print("Size of file is :", sizeof_fmt(int(f.tell())))
def read_parallel(path):
files = []
print(f"Reading files in parallel {path}")
if CAUSE_CRASH:
ls_done = 0
for root, dirs, files in os.walk(path):
files = [os.path.join(root, file) for file in files]
ls_done += 1
if ls_done % 10 == 0:
print("ls done: ", ls_done)
print("Got files: ", len(files))
done = 0
for root, dirs, files in os.walk(path): # Uses a generator
with ThreadPoolExecutor(max_workers=64) as executor:
futures = [executor.submit(read_file, os.path.join(root, file)) for file in files]
for future in as_completed(futures):
done += 1
if done % 10 == 0:
print("Reads done: ", done)
print(future.result())
if __name__ == '__main__':
read_parallel('/covid')
Turns out this works fine. Memory usage of goofys was stable around 37 MB for about 15 min of reading files (after which I stopped the process).
Afterwards, I added these lines before starting the os.walk loop:
for root, dirs, files in os.walk(path):
files = [os.path.join(root, file) for file in files]
print("Got files: ", len(files))
This kills goofys because it's no longer a generator (there's something going on under the hood which translates this into a recursive stat syscall). The memory requirement keeps climbing in a bursty pattern (increases by ~1 GB in a few seconds, then stays there for a couple of min before increasing again). The command kept running for about 20 min and goofys was using about 6 GB in memory after which I ran out of patience. It never OOM'd, but I can see how it may happen for other workloads/buckets.
Recursive ls is a fundamental challenge on these object stores (aws s3 ls also gets stuck for a long time, though it doesn't leak memory). Goofys probably inherits this problem from the fact that it's backed by an object store.
We could mention in our docs that we should avoid recursive ls
operations. Maybe encourage the use of generators to do small ls ops. Any other ideas?
I think one common workload for ML Dataset
class is something like the following:
class Dataset():
def __init__(self):
self.train_files: List[Path]
self.valid_files: List[Path]
self.test_files: List[Path]
For training (especially for ImageNet), the dataloader will wrap the dataset, and randomly index the self.train_files
to form each batch and read the files in the batch. I suspect the reason, Rocky met the problem with OOM would be that before the training starts they will traverse the whole directory, and find out all the file_paths under train/
folder to form the self.train_files
list.
I think maybe one way is to ask the user to keep a metadata file train.meta
in their s3 bucket that lists all the paths of the training samples.
@romilbhardwaj It seems like the two versions don't differ a lot? Why would one trigger the use of a generator but not the other? OK:
for root, dirs, files in os.walk(path): # Uses a generator
with ThreadPoolExecutor(max_workers=64) as executor:
futures = [executor.submit(read_file, os.path.join(root, file)) for file in files]
Not-OK:
for root, dirs, files in os.walk(path):
files = [os.path.join(root, file) for file in files]
@Michaelvll Great point. We should follow up with him to check.
great point @concretevitamin - my crude benchmark above was a bit unfair. I was running both scripts for the same amount of time, but the script which was reading files (the one I labelled to ok) was processing at about 10x slower rate, so effectively it's memory util was also going to grow over time.
I updated the script above to also print number of file ls'd and number of files read. After reading 60K files (CAUSE_CRASH=False
in the script above), the memory utilization of goofys was 371MB (baseline is about 40 mb at init). After ls'ing 60K files (CAUSE_CRASH=True)
, the memory utilization was ~500MB. So I guess the number of ls calls is problematic.
ls
ing them) also cause this bugThis is partially alleviated with #726 by reducing goofy's memory utilization by roughly 30-40%, but still OOMs. Here's a an awesome benchmark from @Michaelvll:
For reference, I tried this branch with the following codes adapted from the code provided by @romilbhardwaj from #622 doc.
It seems read-only mode has no effect to memory usage.
Benchmarks on an
m4.2xlarge
:
wandb panel File listing speed Mem Increase (GB) / M files bbe678be (romil’s branch) 90 K /s ~3 Read-only option (-o ro) 90 K /s ~3 import os import psutil import wandb wandb.init(project='goofys-profile') path = '/data_ro' file_cnt = 0 for root, dirs, files in os.walk(path): for file in files: fp = os.path.join(root, file) if file_cnt % 100000 == 0: wandb.log({'num_files': file_cnt, 'mem_used_gb': psutil.virtual_memory().used / 2**30}) file_cnt += 1
Since we still OOM after some time, I'll leave this issue open for now.
I tried benchmarking using mounting with rclone and it seems like the os.walk
speed is much, much lower and the memory usage still increases as we read more files:
https://wandb.ai/mraheja/rclone-benchmark
The benchmark code I used:
import os
import psutil
import sys
import wandb
wandb.init(project='rclone-benchmark', name = sys.argv[1])
path = ...
file_cnt = 0
for root, dirs, files in os.walk(path):
for file in files:
fp = os.path.join(root, file)
if file_cnt % 1000 == 0:
wandb.log({'num_files': file_cnt, 'mem_used_gb': psutil.virtual_memory().used / 2**30})
file_cnt += 1
The three options for mounting I tried:
rclone mount remote:fah-public-data-covid19-cryptic-pockets ./data
rclone mount remote:fah-public-data-covid19-cryptic-pockets --read-only ./data_readonly
rclone mount remote:fah-public-data-covid19-cryptic-pockets ./data_cached --vfs-cache-mode full
These were done on a t2.small
instance in us-east-2
although I did get similar results when running locally too.
I don't know whether this is worth considering further given the speed limitation and also given that it seems like it will also run into OOM, but I'm also not entirely sure why it runs into these issues. Running rclone ls
directly lists files pretty quickly so unsure why os.walk
on the mounted version is so much slower.
Nice investigation, @mraheja.
Running
rclone ls
directly lists files pretty quickly so unsure whyos.walk
on the mounted version is so much slower.
Is there a GitHub issue with rclone on this?
Thanks @mraheja! This are very useful insights.
Did you actually encounter an rclone OOM? If not, can you run this for longer (say, a day) on a larger instance (m5 or p3) so that we have a fair performance comparison? t2 instances are severely compute and bandwidth constrained
Have tried setting --checkers 1
as mentioned here: https://forum.rclone.org/t/oom-with-big-buckets/19140/10
Even if it's slower, I would be interested in using rclone instead of goofys if it is more reliable. The intention of MOUNT
mode is to support writes and provide easy access to larger buckets. Performance, while nice to have, is not as hard a requirement as stability is (as discussed in the proposal: https://docs.google.com/document/u/1/d/1uVNTWk44lvzPFTr9flfPZrs-sT6fiKlqnAPROprwEmU)
Is there a GitHub issue with rclone on this?
Not that I see--but I'm also not sure if we should expecting them to run at the same speed.
@romilbhardwaj I did not encounter an OOM or try --checkers 1
, I'll try running a short job with checkers and then try the best settings for a day to test stability.
Also, given that the purpose is to support writes, do you think the benchmark should be modified to write something to these files?
Turns out rclone doesn't run much faster with --checkers 1
flag and also runs into an OOM eventually: https://wandb.ai/mraheja/rclone-benchmark. I was using a m4.2xlarge
and it did seem to allow for significantly more files than Goofys before running into that OOM though.
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.
Marking as not stale. Though we haven't heard many OOM complaints from users, we should close this issue only after verifying this problem no longer exists.
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.
This issue was closed because it has been stalled for 10 days with no activity.
User report:
I was trying out the s3 file system mount. With some config like:
This worked ok initially and I was able to start a training job and run initialization of the dataloader. However half way through I’m starting to get errors like
And if I check the mounted folder it’s gone (also gone in df -h). When I first started the instance it was OK though.
from syslog, looks like goofys ran out of memory lol
using a g4dn.xlarge currently.
Usage pattern
there might be some concurrent operations after the directory listing steps too for data loading.
TODO: reproduce, investigate, mitigate?