lithops-cloud / lithops

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀
http://lithops.cloud
Apache License 2.0
317 stars 105 forks source link

Add System Monitor Class for Enhanced Debugging (CPU and net i/o) #1242

Closed Usamasource closed 8 months ago

Usamasource commented 8 months ago

Description:

This update introduces the CPUMonitor class, designed to periodically monitor CPU usage when Lithops is operating in debug mode (log_level=debug). The addition of this class aims to facilitate more in-depth performance analysis and troubleshooting.

Key Features:

Periodic Monitoring: The CPUMonitor class periodically checks CPU usage, aggregating data to provide insights into CPU performance during Lithops execution.

Debug Mode Exclusive: To minimize resource consumption, the CPU monitoring functionality is enabled only in debug mode. This ensures that the additional overhead of running a monitoring thread does not impact regular Lithops operations.

Apologies for the previous pull request, but it seems I had an issue with Git while using WSL during the pull request. I removed unused imports so I think that now my code is clear. Thank you.

Developer's Certificate of Origin 1.1

   By making a contribution to this project, I certify that:

   (a) The contribution was created in whole or in part by me and I
       have the right to submit it under the Apache License 2.0; or

   (b) The contribution is based upon previous work that, to the best
       of my knowledge, is covered under an appropriate open source
       license and I have the right under that license to submit that
       work with modifications, whether created in whole or in part
       by me, under the same open source license (unless I am
       permitted to submit under a different license), as indicated
       in the file; or

   (c) The contribution was provided directly to me by some other
       person who certified (a), (b) or (c) and I have not modified
       it.

   (d) I understand and agree that this project and the contribution
       are public and that a record of the contribution (including all
       personal information I submit with it, including my sign-off) is
       maintained indefinitely and may be redistributed consistent with
       this project or the open source license(s) involved.
Usamasource commented 8 months ago

In my latest commit, I renamed the CpuMonitor class to SystemMonitor to better reflect its enhanced functionality. It now monitors data sent and received on the user's machine using psutil, encompassing all IO operations, even those not involving Lithops storage, like boto3 usage. This monitoring includes user data and a minimal container-related overhead, totaling approximately 0.01 to 0.025 MB, as verified in testing. Additionally, users can now monitor their runtime object storage read and write operations. What are your thoughts on these changes?

JosepSampe commented 8 months ago

I have a few comments:

  1. Not sure if this is true, but based on this comment it seems all the values in the psutil lib are already cumulative. This means you don't need the CPUMonitor thread and and run it every second. If this is true you only need to add something like this at the end of the handler.py:
    
    cpu_usage, cpu_system_time, cpu_user_time = calculate_cpus_values()
    call_status.add('worker_func_cpu_usage', cpu_usage)
    call_status.add('worker_func_cpu_system_time', round(cpu_system_time, 8))
    call_status.add('worker_func_cpu_user_time', round(cpu_user_time, 8))
    call_status.add('worker_func_cpu_total_time', round(cpu_system_time + cpu_user_time, 8))

net_io = psutil.net_io_counters(pernic=False) call_status.add('worker_func_sent_net_io', net_io.bytes_sent) call_status.add('worker_func_recv_net_io', net_io.bytes_recv)



2. In this PR I can see 4 commits that are not relevant to this PR (some made 3 years ago), please remove them

3. You have to update the [api_stats.rst](https://github.com/lithops-cloud/lithops/blob/master/docs/source/api_stats.rst) file and include the new stats you add in the PR
Usamasource commented 8 months ago

Added start and end methods to SystemMonitor to avoid the use of threads, and now no interval is specified. According to the psutil documentation, cpu_times should be called and ignored once, and the second call will return the CPU usage since the last function call.

abourramouss commented 8 months ago

Hello,

Since we're considering both network and cpu usage, memory usage could also be an addition? Currently I am wrapping my pipeline around lithops workers to extract that information.

Thanks.

Usamasource commented 8 months ago

Hi @JosepSampe, the monitoring added introduces an overhead of approximately 0.00516 seconds to the execution time.

Usamasource commented 8 months ago

Hi @abourramouss,

I've added the key memory metrics to our stats for a better understanding of resource use. Could you check if these metrics look good to you?

JosepSampe commented 8 months ago

Hi @JosepSampe, the monitoring added introduces an overhead of approximately 0.00516 seconds to the execution time.

Does this include memory stats? If so or similar, it is negligible, so you can remove if task.log_level == logging.DEBUG: and we can return the statistics always

Usamasource commented 8 months ago

Hi @JosepSampe, the monitoring added introduces an overhead of approximately 0.00516 seconds to the execution time.

Does this include memory stats? If so or similar, it is negligible, so you can remove if task.log_level == logging.DEBUG: and we can return the statistics always

Yes, the time measurements we're capturing include memory, CPU stats, and everything else. The maximum overhead I've experienced so far has been 0.01 seconds.

abourramouss commented 8 months ago

@JosepSampe @Usamasource Can you give me your opinion?

I think the current approach for non debugging purposes is excellent as it provides an overview of what's happening at each worker with a really small overhead.

I want to propose an approach for debugging purposes, it's something i have been using on a pipeline and might be useful for Lithops.

In my pipeline (currently about radio interferometry) i wrapped the Lithops futures around my "custom" framework for profiling, which consists on a monitoring process that is spawned along the Lithops process. The monitoring process and Lithops process are communicated through a pipe and the monitoring proccess is constantly polling statistics (each second) from the Lithops process, when the lithops process finishes, it sends a signal to the monitoring process to stop the recollection of statistics, those are sent back to the Lithops process to be sent to the client (And now that i think about it, we could send it in a streaming fashion, where data is sent constantly to the client via a socket?)

A part from that, I added a decorator to check execution times of several functions within a step, which are returned to the client. This results in two types of plots, a gantt plot to see the execution time of each function within a step and the SystemMonitor stats. Sadly there's no automatic way (that I know of) for the time each function took within each worker, so i had to leverage decorators.

The plots show the average usage of resources and times for each worker, obviously the user would be able to plot whatever he wants, like the aggregated usage of resources or whatever.

An issue with this approach is that if the polling frequency is small and the duration is big enough, then the logs can be huge.

image rebinning_average_runtime_7076_chunksize_512

JosepSampe commented 8 months ago

@Usamasource Have you checked that the new stats are correct? For example, running a call_async() function that reads 100MB of data from s3 and theworker_func_recv_net_io stat shows a value ~100MB, the same by running a call_async() function writting a 100MB from s3, and so one... ? And the same for the memory usage; if you read a 100MB from s3 and load the object in memory, it should show approx slightly more than 100MB in the memory stats

Why in the stats.md the worker_func_cpu_usage' shows 16 cores? did you test it with a machine of 16 cores? or did you test it in AWS Lambda and it show all the cores from the underlying machine? Why all them are 0? if this value is collected properly, can you put in the stats.md a simple real example with less cores and everything in the same line? like 'worker_func_cpu_usage': [3.0, 2.5, 60.0]

You are capturing the memory stats using self.process = psutil.Process(process_id) where the process id is the handler.py process id. The jobrunner.py runs in a different child process. Did you verify that the memory usage is correctly captured?

JosepSampe commented 8 months ago

@abourramouss I understand the first gantt diagram is more related to your app and any user can generate its own, but the metrics you show in the 2nd diagram looks nice and for general purpose.

Are you using the same library as here (psutil)? how this differ from what is being added here? does it make sense to include both functionalities? If so, can you submit a separate PR? Probably it would make sense to extend the plots.py file to include a new method that can generate those 6 plots.

abourramouss commented 8 months ago

@abourramouss I understand the first gantt diagram is more related to your app and any user can generate its own, but the metrics you show in the 2nd diagram looks nice and for general purpose.

Yes, metrics on the 2nd diagram are generic and do not need of any decorators.

Are you using the same library as here (psutil)? how this differ from what is being added here? does it make sense to include both functionalities? If so, can you submit a separate PR? Probably it would make sense to extend the plots.py file to include a new method that can generate those 6 plots.

So the difference is that the current PR is more a global overview on what's happening inside the each worker, the great thing about this is that it has a very little overhead and it's a good idea to have these stats always.

Yes, everything is psutil.

I think it does make sense to include both functionalities, but since intuition tells me it adds a lot of overhead, it might be a better idea to use it only for debugging mode i.e the user doesn't know what's happening inside the worker, or maybe a process doesn't scale well enough. Then you can have a light monitoring for every day (this PR), and a heavy one (that probes the lithops process every second) to get detailed statistics when debugging.

GEizaguirre commented 8 months ago

Hi there,

I have been checking the CPU monitoring utility lately, and found an observation worth mentioning (at least at its documentation).

Checked its functioning with the following, cpu-bound function (with s=5000):

def calculation(s):

    # Create two large matrices
    matrix1 = np.random.rand(s, s)
    matrix2 = np.random.rand(s, s)

    # Perform dot product
    result = np.dot(matrix1, matrix2)

Which locally executed on a single CPU, gives a CPU usage of almost 99%.

When executed at AWS Lambda (at 1769MB runtime memory, which should provide the equivalent to a single vCPU), got the following result (essentially the same as in local, split into two):

[41.1, 49.6]

Which makes sense, as Lambda, at 1769MB runtime memory, seems to give you the capacity of a single CPU, through two non-concurrent vCPUs.

Maybe the CPU monitoring results should be pairwise-sumed, or at least provide a note within the documentation on result interpretation.

Usamasource commented 8 months ago

@Usamasource Have you checked that the new stats are correct? For example, running a call_async() function that reads 100MB of data from s3 and theworker_func_recv_net_io stat shows a value ~100MB, the same by running a call_async() function writting a 100MB from s3, and so one... ? And the same for the memory usage; if you read a 100MB from s3 and load the object in memory, it should show approx slightly more than 100MB in the memory stats

Yes, It has been confirmed that the network traffic statistics match exactly with the volume of data handled, with a minimal overhead of 0.01 to 0.025 MB. This was verified using various functions like call_async and map.

Why in the stats.md the worker_func_cpu_usage' shows 16 cores? did you test it with a machine of 16 cores? or did you test it in AWS Lambda and it show all the cores from the underlying machine? Why all them are 0? if this value is collected properly, can you put in the stats.md a simple real example with less cores and everything in the same line? like 'worker_func_cpu_usage': [3.0, 2.5, 60.0]

The display of 16 cores in stats.md for worker_func_cpu_usage is due to a test conducted on a local machine with 16 cores. I will update stats.md with a test on a machine with fewer cores to simplify the example.

You are capturing the memory stats using self.process = psutil.Process(process_id) where the process id is the handler.py process id. The jobrunner.py runs in a different child process. Did you verify that the memory usage is correctly captured?

Using the get_total_memory_usage function, I measured the total memory usage, including both the main process and its child processes. This method yielded a usage of 59.87109375 MB. Comparing this result with the worker_func_rss value, which was 59.875 MB and represents the physical memory used, a close match was observed. This confirms the accuracy of the memory measurement performed.

def get_total_memory_usage(process_id):
    parent = psutil.Process(process_id)
    children = parent.children(recursive=True)
    total_memory_usage = parent.memory_info().rss  # rss es Resident Set Size, la porción de memoria ocupada en RAM.
    for child in children:
        total_memory_usage += child.memory_info().rss
    return total_memory_usage
JosepSampe commented 8 months ago

@Usamasource Thanks for the verification. Is the PR ready now? Any comments on German's post above?

Usamasource commented 8 months ago

Hi @JosepSampe,

Now it's ready, but I'll continue working on the monitoring part to monitor the nodes.

As for @GEizaguirre's post, for this case in Lambda, summing CPU usages is the best option. However, in a local environment with 8 cores, summing in pairs would result in 4 values, missing the presence of 8 cores. Would it be better to detail this difference in the documentation, or is it preferable to calculate the average CPU usage and consolidate it into a single value? I did this to provide more detail on how the load is distributed among CPUs and to verify if the resources for that machine are being fully utilized.