se2p / pynguin

The PYthoN General UnIt Test geNerator is a test-generation tool for Python
https://www.pynguin.eu
MIT License
1.24k stars 75 forks source link

Add support for running the test cases in subprocesses #74

Open BergLucas opened 3 months ago

BergLucas commented 3 months ago

Is your feature request related to a problem? Please describe.

Over the last four months, I've had the opportunity to write a master's thesis about improving automated test case generation for machine learning libraries. During this project, I discovered several limitations and bugs in Pynguin that I've already reported. However, some bugs could not be easily fixed. These are segmentation faults, memory leaks, floating point exceptions, and Python's GIL deadlocks, which do not come from Pynguin but rather from the module under test. Unfortunately, with Pynguin's current architecture which executes test cases in threads rather than subprocesses, these types of bugs cause the main process to crash and, therefore, the crash of Pynguin. I have observed these kinds of crashes on very popular libraries such as numpy, pandas, polars, scipy and sklearn, and it could also happen on other modules as I've only focused on these few.

Describe the solution you'd like

To solve the problem, I propose to change some aspects of Pynguin's architecture so that test cases can be executed in subprocesses depending on a Pynguin parameter. I've already built a working prototype here, but due to too much data transfer between the main process and the subprocesses, the execution in a subprocess is up to 40x slower than the execution in a thread, so I think it would first be necessary to rethink the changes I've made in my prototype to increase speed by limiting data transfer between the main process and the subprocesses.

Describe alternatives you've considered

To the best of my knowledge, I think that the only way to detect segmentation faults, memory leaks, etc, is to use subprocesses, so I don't see any other alternatives for dealing with these crashes.

Additional context

With this new architecture, it would also be possible to create error-revealing test cases, as Randoop does. Indeed, by checking the exit code of the subprocesses, a crash can be detected and, therefore, a test case created to reproduce it. This is something that has already been implemented in my prototype and that has already helped me find a few bugs in some libraries:

nickodell commented 3 months ago

I've already built a working prototype here, but due to too much data transfer between the main process and the subprocesses, the execution in a subprocess is up to 40x slower than the execution in a thread, so I think it would first be necessary to rethink the changes I've made in my prototype to increase speed by limiting data transfer between the main process and the subprocesses.

I see you're using the spawn start method. I wonder if you could improve performance here by using a forkserver start method, and using set_forkserver_preload() to preload the module under test.

I know that in SciPy, it can take quite a while to import some modules. For example, on my computer it can take 0.4 seconds to run import scipy.signal and nothing else. It seems like it could be possible to import the module once, and re-use that work, since the computation happening during import seems unlikely to be interesting from a testing perspective.

More info: https://bnikolic.co.uk/blog/python/parallelism/2019/11/13/python-forkserver-preload.html

stephanlukasczyk commented 3 months ago

Wow, first of all, this is impressive!

I agree to all of your points, the way the current execution is being built is probably not the best it could be. I am very much willing to integrate this into Pynguin, first, because I believe that it could overcome the limitations and makes Pynguin more flexible, and second, because the track record (five found bugs) is already quite nice—I am pretty sure there is more to follow and I need to maintain a list of found bugs at some point.

Regarding the slow-down: do you have some average numbers? 40x is a massive slowdown, I agree, but if this is only a rare worst case the picture would probably look different. Also, could you perhaps try @nickodell 's suggestion (thanks @nickodell for the suggestion of the forkserver) if that would bring an improvement?

Random additional thought: even if the execution in a subprocess is slower, do you see any potential to parallelise these executions? Currently, Pynguin uses the threads to isolate executions but only executes the test cases in sequential order. If the subprocess would allow to parallelise test-case executions easily, the overhead might not be that critical any more.

BergLucas commented 3 months ago

Hi @stephanlukasczyk,

I do have some numbers regarding the slow-down, but it's a bit hard to interpret an average because the 40x speed decrease was calculated using the few cases where Pynguin didn't crash, so I felt that averaging out all the cases wasn't very representative of the true speed decrease. If you're interested, here's my master's thesis, there are averages on the number of iterations achieved per module in chapters 5 and 7. This thesis also tried to implement a plugin system to allow testers' knowledge to be easily incorporated into the test generation algorithm. Initially, the architectural change was just a necessary improvement to be able to run the plugin system on machine learning libraries but I thought it was the most interesting change to add to Pynguin at the moment.

Regarding the forkserver, it might be interesting to check whether this has an impact. However, I noticed that even with just a spawn start method, the thing that took the most time was not sending data to the subprocess, but transferring data from the subprocess to the main process. I haven't checked this in detail yet, but I think it's due to the fact that there are a lot of references in the ExecutionResult class and that, in the end, the whole test cluster is transferred back to the main process.

Regarding the parallelization, I did try to implement it at one point and noticed that most of the time, it was faster to start a single subprocess, run every test case in it, and fallback to running each test case in separate subprocesses only when a crash was detected. That's what I've done to improve the speed of the "TestSuiteChromosomeComputation" class. However, it's true that it could be interesting to parallelize the "TestCaseChromosomeComputation" class, but I think that would require a lot of changes, and I didn't do it because I didn't have much time.

stephanlukasczyk commented 3 months ago

Hi @BergLucas ,

What I could do is running Pynguin from your branch with the current executor and your subprocess-based executor on some benchmark that I have. I'll set this up and run it—will report back numbers as soon as they have arrived.

Also thank you for your thesis, I'll have a look. Your comments regarding the data that has to be transferred between processes is quite helpful; avoiding to transfer large amounts of data might be achievable, but first I want to see how the executors behave on my benchmark. I'll add results to this issue as soon as I have any.

BergLucas commented 3 months ago

The subprocess-based executor is only used when using the --subprocess parameter if you did not see it, so don't forget to run Pynguin with this parameter @stephanlukasczyk. Also, you should probably use simple assertions instead of mutation analysis ones, or it could take hours to generate the assertions on some modules at the moment.

stephanlukasczyk commented 3 months ago

Finally found the time to run Pynguin on the latest version of your branch, @BergLucas . Configurations use DynaMOSA in its default settings, assertion generation deactivated, with a generation timeout of 600s. DEFAULT_EXECUTION refers to the current, default, execution; SUBPROCESS_EXECUTION is your proposed approach. First, we start with some statistics (basically the describe function of a Pandas DataFrame) from the raw data:

Coverage Overview
                       count      mean       std  min       25%       50%       75%  max
ConfigurationId                                                                         
DEFAULT_EXECUTION     9514.0  0.685906  0.308685  0.0  0.469880  0.750000  1.000000  1.0
SUBPROCESS_EXECUTION  9332.0  0.638031  0.327191  0.0  0.333333  0.666667  0.972973  1.0

Iterations Overview
                       count         mean          std  min    25%     50%     75%     max
ConfigurationId                                                                           
DEFAULT_EXECUTION     9514.0  1588.290309  1397.305685  0.0  57.25  1446.0  2665.0  7071.0
SUBPROCESS_EXECUTION  9332.0    72.648093    77.326008  0.0  11.00    52.0   106.0   467.0

Speed Overview
                       count      mean       std  min       25%       50%       75%        max
ConfigurationId                                                                               
DEFAULT_EXECUTION     9514.0  2.772500  2.289745  0.0  0.422727  2.617743  4.512063  11.765391
SUBPROCESS_EXECUTION  9332.0  0.124591  0.128730  0.0  0.022979  0.091211  0.181255   0.775748

Because I know that there are always some failures, I filter the raw data and remove all modules that did not produce a result in all repetitions (I did 15 repetitions per configuration):

Remove modules that did not yield 15 iterations
Before: 645
After:  574

Coverage Overview
                       count      mean       std  min       25%       50%  75%  max
ConfigurationId                                                                    
DEFAULT_EXECUTION     8610.0  0.704771  0.302880  0.0  0.500000  0.774194  1.0  1.0
SUBPROCESS_EXECUTION  8610.0  0.658459  0.321583  0.0  0.383065  0.722222  1.0  1.0

Iterations Overview
                       count         mean          std  min  25%     50%     75%     max
ConfigurationId                                                                         
DEFAULT_EXECUTION     8610.0  1640.112544  1429.615295  0.0  9.0  1561.0  2759.0  7071.0
SUBPROCESS_EXECUTION  8610.0    74.888850    79.414983  0.0  7.0    53.0   111.0   467.0

Speed Overview
                       count      mean       std  min       25%       50%       75%        max
ConfigurationId                                                                               
DEFAULT_EXECUTION     8610.0  2.871112  2.335756  0.0  0.343301  2.819468  4.673877  11.765391
SUBPROCESS_EXECUTION  8610.0  0.128680  0.132070  0.0  0.019231  0.094059  0.188954   0.775748

Relative Coverage
                      BranchCoverage  RelativeCoverage
ConfigurationId                                       
DEFAULT_EXECUTION           0.704771          0.916433
SUBPROCESS_EXECUTION        0.658459          0.730026

Finally, I've also plotted coverage over the 600s generation time:

coverage over time

The plot and the results show what you've already noted, namely that there is a large slow-down, which also influences coverage significantly.

BergLucas commented 3 months ago

Hi @stephanlukasczyk ,

That's very interesting. I would have thought that the number of modules that did not yield 15 iterations would have been higher on default execution because of the types of crashes I mentioned before, but that doesn't seem to be the case. In theory, the execution using subprocesses can't crash unless there's a bug in the implementation so I guess the branch isn't super stable yet as I suspected.

stephanlukasczyk commented 3 months ago

If you are interested and have time for debugging, I can provide you the full raw results, including log files etc.

nickodell commented 3 months ago

Hello, I've fixed two of the crash bugs you identified in SciPy. I think this is very valuable in terms of identifying surprising corner-cases in the library.

By the way, I would be interested in looking into the performance issue with subprocess-based execution. Would you mind showing me how to set up your branch to test SciPy? I took a look at the docs, but I'm not sure how to apply that in the case where I'm testing a package that needs to be built before I can test it.

stephanlukasczyk commented 3 months ago

Thank you @nickodell that you offer to investigate the performance.

What I usually do is setting up a new virtual environment (based on Python 3.10 because Pynguin will only work with this version), install Pynguin into this virtual environment and also install the respective library into it--scipy in this case. I can then run Pynguin from this environment, having all the dependencies required. This should work also for binaries that are part of a package.

If @BergLucas uses a different approach, he probably could elaborate, too.

nickodell commented 3 months ago

What I usually do is setting up a new virtual environment (based on Python 3.10 because Pynguin will only work with this version), install Pynguin into this virtual environment and also install the respective library into it--scipy in this case. I can then run Pynguin from this environment, having all the dependencies required. This should work also for binaries that are part of a package.

Thanks - when you do this, what do you set the --project-path parameter to?

stephanlukasczyk commented 3 months ago

I usually also have a source-code checkout of the subject under test (here scipy) lying around to which I point the parameter to.

BergLucas commented 3 weeks ago

Hi @stephanlukasczyk, I've been busy over the holidays so I haven't had time to work on Pynguin but normally I'll be able to carry on with my research and intend to keep looking for ways to improve Pynguin. So, I would be very interested to get the raw data you mentioned in your previous comment if you still have it.