Add memory benchmarks - Githubissues

braceletboy commented 4 years ago

Hi @zoq and @rcurtin , pardon me for contributing to this repository after a long break. So, the last time we talked about benchmarking memory usage of the machine learning algorithms on the IRC. So, I have made some changes to the repository to do this, which I am presenting here. I made some minor fixes in some of the initial commits. The main commits corresponding to memory benchmarking are c33047a and db9df17. Let me know what you think of these changes.

The f1ee3aa commit was made because running make setup doesn't install the packages directly and can throw errors if dependencies are not satisfied. Also, I don't think that these dependencies apply to all kinds of Linux systems. Mine is Ubuntu 18.04, and I have limited knowledge of other systems to talk about these dependencies confidently.

PS: This is not the final pull request. Feedback is needed.

zoq commented 4 years ago

Thanks for looking into this, excited to see some first results.

braceletboy commented 4 years ago

@zoq So, I have some questions regarding the memory benchmarking system. 1) The memory_usage function returns the max. memory usage (when the max_usage argument is True) of the current process we are benchmarking and also it's children (when the include_children argument is True). In this, we are taking into account, the memory assigned to the process as a whole. This might include the memory assigned by other parts of the program (the benchmarking support structure in run.py), the memory used for data. So, I have two sub-questions in this regard:

a) Should the memory usage for the data be included? b) Should the memory usage for the support structure be included? (which seems like an obvious no. During some initial analysis, I used the psutil's memory_info function to the memory underuse just before the loading of the benchmark script in run.py. It seems to vary from ml algorithm to another but seems to be considerable. Around 30-60MB .)

2) After some careful analysis of the memory_profiler code, I came to understand that the memory_usage function benchmarks the memory by spawning a child process that tracks the memory of the parent process by using the psutil's memory_info function at regular intervals. This might mean that when we use the memory_usage function with include_children=True, we are also including the memory used for this child process that does the tracking. Will this be an issue?

zoq commented 4 years ago

a) Should the memory usage for the data be included?

For the runtime benchmarks, we do not include data loading as the focus is on the method runtime; I think for the memory benchmarks we should do the same.

For the data loading part, some libs support different formats like loading binary, which are often faster or some libs do some data encoding which isn't part of the method itself, so to do a fair comparison it makes sense to exclude the data loading part.

b) Should the memory usage for the support structure be included?

If we can avoid that, which I think we can if we put the memory benchmark inside the method script we should do that.

After some careful analysis of the memory_profiler code, I came to understand that the memory_usage function benchmarks the memory by spawning a child process that tracks the memory of the parent process by using the psutil's memory_info function at regular intervals. This might mean that when we use the memory_usage function with include_children=True, we are also including the memory used for this child process that does the tracking. Will this be an issue?

Actually, I think we want to track child processes as well, to track methods that split the process into multiple children.

braceletboy commented 4 years ago

@zoq

For the runtime benchmarks, we do not include data loading as the focus is on the method runtime; I think for the memory benchmarks we should do the same.

I felt the same initially, but then I realized that when the benchmark scripts use subprocess, I had no way to separate the memory used for data loading from memory used for the actual algorithm. See line 47 and line 66 of the mlpack's allkfn benchmark; the data only gets loaded inside the subprocess that runs the mlpack_kfn binary with the given configured options. The same holds for MATLAB, R, ann, dibml, flann - whose scripts use subprocess.

So, a) Do you think there is a way to separate the memory used for data loading and the memory used for the algorithm when we use subprocess? b) Also, isn't it unfair for these libraries when the data loading is getting included in their runtime benchmark?

Actually, I think we want to track child processes as well, to track methods that split the process into multiple children.

Cool.

zoq commented 4 years ago

I felt the same initially, but then I realized that when the benchmark scripts use subprocess, I had no way to separate the memory used for data loading from memory used for the actual algorithm. See line 47 and line 66 of the mlpack's allkfn benchmark; the data only gets loaded inside the subprocess that runs the mlpack_kfn binary with the given configured options. The same holds for MATLAB, R, ann, dibml, flann - whose scripts use subprocess.

I see that is tricky, in the past I used valgrind massif to track the memory consumption, so maybe we can do something similar? That would include the data loading part as well, but a user could check the results and manually filter out the data loading part? On a second thought, I think I even like to include the data loading part, as we want to show what amount of memory is used by a specifc method, each lib has to hold the data somehow, the format shouldn't matter and if some lib uses e.g. a sparse matrix we should account for that.

Also, isn't it unfair for these libraries when the data loading is getting included in their runtime benchmark?

We time the data loading/saving part and subtract it from the overall runtime; see:

https://github.com/mlpack/benchmarks/blob/c11f08d6fad54ebf2c95f6b73e0c7af2c801fb4e/methods/mlpack/allkfn.py#L76

for an example. We do the same for MATLAB , R, etc.

braceletboy commented 4 years ago

I see that is tricky, in the past I used valgrind massif to track the memory consumption, so maybe we can do something similar? That would include the data loading part as well, but a user could check the results and manually filter out the data loading part? On a second thought, I think I even like to include the data loading part, as we want to show what amount of memory is used by a specific method, each lib has to hold the data somehow, the format shouldn't matter and if some lib uses e.g. a sparse matrix we should account for that.

So we have the following options: 1) Track the memory consumption in such a way that we can separate out the data loading part and give the option of including it to the user. I like this idea but it needs to be seen how this can be done. I will look into valgrind massif. 2) Include all the memory consumption into one. This means that the above code doesn't need any changes? Let me know :)

We time the data loading/saving part and subtract it from the overall runtime; see:

https://github.com/mlpack/benchmarks/blob/c11f08d6fad54ebf2c95f6b73e0c7af2c801fb4e/methods/mlpack/allkfn.py#L76

for an example. We do the same for MATLAB , R, etc.

That's great. I didn't see that line. :)

zoq commented 4 years ago

Personally I would go with option 2, as holding the data is part of how a method performances memory wise. We might want to take a look into massif anyways, as it might return provide some more useful informations. Let me know what you think.

mlpack / benchmarks

Add memory benchmarks #142