Fil profiler fails for fairly large dataset

mattgerg12 commented 1 year ago

I am doing some benchmarking and thought about using fil.

I am loading a csv file of size 3 GB to my 16 GB RAM mac. I am able to load the file completely fine in python3 with Fil kernel, but when I use %%filprofile The kernel is dying; and upon checking logs I saw out-of-memory.svg file in some temp folder with memory showing around 6000 MB. This shouldn’t be the case as I can load the data completely fine without fil and other profilers work with no issues. (like below)

I am wondering why this is happening when I am trying to use %%filprofile. All other profilers works fine like I tried memory profiler and it is showing as follows.

I don’t get any issue when I tested fil by profiling 1+1

Is fil designed to profile small dataset cases ? What can I try from my end to make fil working ?

Just as a side" note how different is fil when compared to memoryprofiler as It also give the peak memory and increment. Are there any strong reasons that fil is better other than the nice graphical view ? I do understand why it is better compared to sys.getsizeof() and memory_usage() from your article

filprofiler==2023.1.0 Python 3.11.0

itamarst commented 1 year ago

Fil's out-of-memory heuristic can sometimes go wrong. Try running it with --disable-oom-detection and see if that helps. Probably OOM SVG should suggest that.
For the 1+1 case, it's probably just so little memory it doesn't show up (0 bytes!). 1 and 2 are pre-allocated in Python by default. So that's a bug I should fix, it should at least indicate that.
In theory memory_profiler can give you the same-ish info, it's just... it takes a lot more time because it only gives you line-by-line info.

Try with --disable-oom-detection and see if that helps.

itamarst commented 1 year ago

Oh, except the OOM detection disabling thing is not currently available in Jupyter. I'll try to fix that (or just disable it by default, we'll see).

mattgerg12 commented 1 year ago

Thanks for getting back. I feel the 1+1 situation is not something that is immediate, as there might not be a need to profile something like that.( but yeah worth putting some warning).

I feel the use of fil with jupyter is good to have especially for data scientists, and the graph that fil prints out is easily interpretable by data scientists; hence I think having this option disabled in jupyter will help us out.

Memory_profiler can give the peak and increment memory for an entire cell in jupyter when used with magics %%memit. Our data scientists use it a lot. But thought of the move to fil because of the more information and the intuitive features.

mattgerg12 commented 1 year ago

I am now kinda confused with OOM. Considering my laptop is of RAM 16 GB

If my dataset uses around 23 GB from Memory_usage (deep true), then why am I not getting OOM when loading the dataset, and the memory_profile gives a peak around 6 GB?

Is it the case that 23 GB of virtual memory ( allocated memory ) is used and around 17GB is swapped to the disk and around 6 GB is in RAM? If this is the case, what is the maximum amount of data that can be handled on my machine using pandas? Is it the free hard disk space?

itamarst commented 1 year ago

So, yes, one thing to keep in mind is that memory_profiler and Fil measure two different things: peak resident memory, i.e. in RAM for the former, and how much memory you requested for the latter. https://pythonspeed.com/articles/measuring-memory-python/ has write up with more details.

I'm told that macOS will keep writing to disk until it has twice as much disk usage as memory, and only then start failing memory allocations/killing your process. So in your case 32GB on disk. But that would be pretty slow.

The out-of-memory detection Fil does, and which I guess I should consider just disabling on macOS, uses heuristics to guess when OOM is approaching, and sometimes it triggers too soon.

You can try Fil with most other applications closed, or just a couple of browser tabs, and see if that gets you further.

itamarst commented 1 year ago

Hopefully I'll have a release with OOM disabled in an hour or two, or tomorrow if tests fail.

As another option, I also work on a commercial Python profiler for data science: https://sciagraph.com

Pros compared to Fil:

ln addition to memory profiling, also includes performance profiling in the same report. Including timeline view so you can see things like parallelism or lack thereof.
Much lower performance overhead than Fil; it's designed to be fast enough to run in production if you need it.
Fil does some changes to how NumPy and other things do threading, which can slow things down even more; Sciagraph tries to be less intrusive. I might change this in Fil at some point though.

Cons compared to Fil:

No macOS support in Sciagraph yet; I'm working on it, should be released in the next few weeks if all goes well.
Can't do memory profiling for small allocations, since it uses sampling for performance. So you can't say "oh this function allocated 100KB". Once you're doing any non-trivial amount of data processing this isn't an issue, if you hit 100MB of memory allocated you're good.

itamarst commented 1 year ago

Oh, and re 1+1, it's quite possible it literally allocated no memory: numbers below a certain size are pre-allocated objects in Python, e.g. here you can see that if you create two numbers that are large, they are different addresses in memory, but 1 is always the same address (as an implementation detail, don't rely on this in code...)

>>> x = 1_000_000
>>> y = 1_000_000
>>> x == y
True
>>> x is y
False
>>> x = 1
>>> y = 1
>>> x is y
True

mattgerg12 commented 1 year ago

@itamarst Thanks for fixing both OOM and 1+1 issue. I am able to workout fil in jupyter.

I noticed it works fine and the graph shows up in jupyter when profiling small tasks. But I noticed when working with big datasets; for eg in my case I am loading 10 GB CSV file the graph is not showing even though it is getting generated like below

I checked those folders and the file exist and it is using peak 53745 MB. Is there any reason why the graph is not showing in my jupyter notebook ?

Also the above number (53745 MB) is allocated memory right ? Where I can find peak resident memory in fil (at least it is not showing in the above graph - it only shows peak tracked memory usage)?

https://sciagraph.com/ looks interesting can I track any tickets to see when Mac support becomes available ?

mattgerg12 commented 1 year ago

I was reading this https://pythonspeed.com/articles/python-out-of-memory/ Does fil profiler also gives how it failed like mentioned in the article ?

Of course, segfaults happen for other reasons as well, so to figure out the cause you’ll need to inspect the core file with a debugger like gdb, or run the program under the Fil memory profiler.

In the above case loading 10 GB file succeed with no issues. If we load that same 10 GB file again to a different variable that too works with no issues (no memory issues). I was just curious on why it is not failing and checked if when loading 10GB second time is using same address space like the first one (similar to simple copy case) - but it is using different address space.

But when we load a 20 GB file (this is csv file basically made by row binding this 10GB file twice) jupyter kernel crashes. Would be great if we can get the reason of the fail like mentioned in the article.

itamarst commented 1 year ago

Fil can give the reason for out-of-memory, yes... but that's the feature I just disabled on macOS because it's too trigger happy :) I opened #494 as an issue to re-enable it. And also the Jupyter integration as you saw isn't great. Sciagraph will eventually grow something similar, and hopefully more robust insofar as it's also supposed to be good enough to run in production; probably as first pass it'll generate multiple reports over the job's runtime.
The operating system will eventually kill processes for using too much memory, or just fail an allocation which can have a variety of symptoms, but when and how it does this is a policy question that can vary based on which OS, and also the version of OS. E.g. newer Ubuntu added userspace OOM killer in addition to kernel OOM killer... So it's hard to answer the question of why one situation happens to cause a failure when another does not. Instead, you want to focus on either:
- Reducing memory usage (so it doesn't matter), e.g. via the various means described here: https://pythonspeed.com/datascience/#pandas
- Or, switch to something like Polars lazy mode with streaming enabled, where it'll try to keep memory usage fixed regardless of data size: https://pythonspeed.com/articles/polars-memory-pandas/
Given you can see memory usage for 10GB file, I would just focus on reducing memory usage there, and that'll solve the OOM issue for 20GB file.
I just got first end-to-end tests working on macOS Sciagraph, so making good progress. If you give me your email (or email itamar@pythonspeed.com if you don't want to post it here) I can let you know when it's released.

itamarst commented 1 year ago

Hi, just FYI I just released the macOS version of Sciagraph

pythonspeed / filprofiler

Fil profiler fails for fairly large dataset #492