googledatalab / datalab

Interactive tools and developer experiences for Big Data on Google Cloud Platform.
Apache License 2.0
974 stars 249 forks source link

read_from large file ~2 GB file => OOM crash #785

Open VelizarVESSELINOV opened 8 years ago

VelizarVESSELINOV commented 8 years ago

When we try to read_from bucket storage file with size not far from 2 GB the system is going down even if the VM has 30 GB RAM.

f_size = data_bucket.item(file).metadata().size * 1e-9
print("File size (GB): {}".format(f_size))

if f_size > 1.9:
    continue  # OOM crash few seconds after read_from() even if I'm using 30 GB RAM VM

n1 = time()
buffer = StringIO(data_bucket.item(file).read_from())

n2 = time()

From "Serial console output":

Mar  8 21:28:50 gae-datalab-main-blabla kernel: [69866.294706] Out of memory: Kill process 14737 (python) score 956 or sacrifice child
Mar  8 21:28:50 gae-datalab-main-blabla kernel: [69866.302664] Killed process 14737 (python) total-vm:31896168kB, anon-rss:29541368kB, file-rss:0kB

With Message dialog:

Kernel restarting
The kernel appears to have died. It will restart automatically.
OK
VelizarVESSELINOV commented 7 years ago

I stopped using datalab API for storage and bigquery, now using only google-cloud packages for more portable code.

pgrosu commented 7 years ago

@VelizarVESSELINOV That's interesting! Are you custom building the VM instances and creating key-value stores from scratch, and applying your custom relational algebra on top of that? I'm just curious the level of granularity and the cloud packages in approaching portability.

yahyamortassim commented 7 years ago

Any updates about this issue? I'm facing the same problem

VelizarVESSELINOV commented 7 years ago

@pgrosu there are a lot custom file formats that need to be parsed, and for this reading StringIO is required. After parsing the output will be a Pandas dataframe optimal for data engineering/analytics/AI. The VM was custom server side defined.