plasma-umass / scalene

Scalene: a high-performance, high-precision CPU, GPU, and memory profiler for Python with AI-powered optimization proposals
Apache License 2.0
12.14k stars 398 forks source link

Table shows misleading results for Yappi #17

Closed sumerc closed 4 years ago

sumerc commented 4 years ago

Hi Emery,

I am author of Yappi and I have seen your work by chance. It seems like a nice project.

I have a simple request if possible: mainpage shows Yappi like 18x slowdown whereas cProfile is 2x, but the real issue is that: Yappi profiles CPU time by default and cProfile profiles Wall time by default. This makes 10x difference since reading HW clocks is an expensive operation and I think to make comparison fair, I would kindly request you to profile the same application(julia_nogil.py) with setting clock_type to wall. Here is the command I used to do that and the results indicate only 2x-2.5x slowdown roughly similar to cProfile(which should be the case). Currently it seems like it is slower even than the line_profiler :)

> yappi -c wall julia_nopil.py
playground » python julia_nopil.py                                      
Length of x: 1000
Total elements: 1000000
calculate_z_serial_purepython took 19.6705131531 seconds
Total sum of elements (for validation): 33219980

Thanks!

emeryberger commented 4 years ago

Hi - thanks for your note!

I guess I am not sure what you would like to appear here - I do think it’s reasonable to use the default settings for tools, given that that is the expected common use case. That said, I can include both results. I observe in passing that Scalene uses CPU time. I do wonder why reading the high performance timer (which I assume is nothing but a shim over something like RDTSC) is expensive: it should in principle be far cheaper than a system call to get elapsed virtual time. FWIW Scalene uses virtual (scheduled) elapsed CPU time. Wall clock time is pretty noisy.

sumerc commented 4 years ago

Hi,

I guess I am not sure what you would like to appear here - I do think it’s reasonable to use the default settings for tools, given that that is the expected common use case. That said, I can include both results.

Yes I agree it is reasonable to use default settings, but on this occasion it gives a false impression like Yappi does have 20x! overhead compared to cProfile measuring the same thing(but it is not measuring the same thing) :) That is why I kindly asked if it possible to add wall-time results. It would be great if you can include both results, too.:)

I do wonder why reading the high performance timer (which I assume is nothing but a shim over something like RDTSC) is expensive: it should in principle be far cheaper than a system call to get elapsed virtual time. FWIW Scalene uses virtual (scheduled) elapsed CPU time

I remember looking at this problem years ago lurking the Linux kernel for clues. I really do not remember the details quite much. I will also be looking over your solution too. Thanks!

UPDATE: I can see that you use process_time() which is equivalent to (AFAIR) clock_gettime(CLOCK_PROCESS_CPUTIME_ID). Yappi uses clock_gettime(CLOCK_THREAD_CPUTIME_ID) as it is trying to get per-thread CPU time. Both these APIs seem to add some overhead.

I also have written a simple application. See:

#include "time.h"
#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>

struct timeval curtime;
struct timespec tp;

int main(int argc, char **argv)
{
    int iterations = atoi(argv[1]);
    int clk_type = atoi(argv[2]);

    for (int i = 0; i < iterations; i++)
    {
        switch(clk_type) {
            case 1:
                gettimeofday(&curtime, NULL);
                break;
            case 2:
                clock_gettime(CLOCK_MONOTONIC, &tp);
                break;
            case 3:
                clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &tp);
                break;
            case 4:
                clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tp);
                break;
        }
    }
    return 0;
}

/*
playground » time ./c1 10000000 1                                                                                      ~/Desktop/p/playground 130 ↵ 
./c1 10000000 1  0,16s user 0,00s system 99% cpu 0,156 total
playground » time ./c1 10000000 2                                                                                           ~/Desktop/p/playground  
./c1 10000000 2  0,19s user 0,00s system 99% cpu 0,189 total
playground » time ./c1 10000000 3                                                                                           ~/Desktop/p/playground  
./c1 10000000 3  2,46s user 1,79s system 99% cpu 4,255 total
playground » time ./c1 10000000 4                                                                                           ~/Desktop/p/playground  
./c1 10000000 4  2,50s user 1,76s system 99% cpu 4,260 total
*/

You can see it adds 15-20x overhead. I am assuming the answer lies in the kernel. I will be looking into that.

emeryberger commented 4 years ago

Updated results now up via https://github.com/emeryberger/scalene/commit/a7afa197e5e3183e2ff9ba1e8ed36cd68bb8dce1.