graph-algorithms / edge-addition-planarity-suite

Other
27 stars 10 forks source link

Update `test_table_generation_with_numInvalidOK.py` to save EDA progress with option to resume analysis #108

Open wbkboyer opened 3 days ago

wbkboyer commented 3 days ago

Background

The Edge-Deletion Analyzer originally developed in #77 was generalized in #79 to check the results of K_{2, 3} search, K_{3, 3} search, or K_4 search to see if, for each .g6-encoded graph in a .g6 input file, the homeomorph search erroneously returned OK when in fact the graph contains a subgraph homeomorphic to the target homeomorph. Then, #80 introduced the test_table_generation_with_numInvalidOK.py, which runs edge-deletion analysis on all .g6 files for each each edge-count of each graph order for which we wish to generate the Test Tables (#65).

Problem statement

Initially, I kicked off the test_table_generation_with_numInvalidOK.py for N=5,11, but discovered after a few days that the program had terminated, only producing the updated test tables for N=5,9; I thought this was due to Windows updating and then rebooting, since I had to log back into my workstation and observed the terminal window had closed, so I restarted the process only for for N=10,11.

The test_table_generation_with_numInvalidOK.py successfully produced the test table for N=10 indicating the numInvalidOKs had been driven down to 0 (see #107 ). Sometime during this processing for N=10, I noted the application was using roughly 6GB of RAM. However, during the processing of N=11, I noticed that the memory footprint continued to increase at a greater pace, ballooning up to 18 GB while working through the graphs with M=19, which alone took 47 hours to process:

2024-10-19 10:48:51,288 - [INFO] - test_table_generation_with_numInvalidOK._call_edge_deletion_analysis - NUMBER OF INVALID OKs in 'n11.m19.g6' is 0; took 42263.218750s process time and 170562.578561s total time

I decided to terminate the program while processing N=11 M=20 when it reached 22GB RAM, since even though it likely could have continued (with Windows handling paging, up to an additional 64GB for the max pagefile size), there was no guarantee for whether we'd make it over processing all graphs in each of the .g6 infiles for the "fat middle" of edge-counts.

This means that if we want to determine the numInvalidOKs for N=11 using edge-deletion analysis, we have to start from scratch running the test_table_generation_with_numInvalidOK.py and hope that we don't run out of memory + pagefile, since there's no way to resume after execution has been halted.

Possible causes

When examining test_table_generation_with_numInvalidOK.py, I noted the following:

  1. I am using with open() Python context manager to create a new file containing the adjacency list representation for each graph in the input file; however, this is still being done for a lot of graphs e.g. 8,640,134 graphs for N=11, M=19, as verified by geng with -u (generate + count without output):
    % ./geng 11 19:19 -u
    >A ./geng -d0D10 n=11 e=19
    >Z 8640134 graphs generated in 2.19 sec

    So even though context managers are meant to safeguard against memory leaks, there might be a bug that's only apparent with such a large number of file writes.

  2. For each graph, if the graph is determined to be free of the target homeomorph but is non(outer)planar, I'm calling subprocess.run() for each graph-minus-one-edge for however many edges are in that graph to see if we can find the target homeomorph. This means we're spawning an INSANE number of processes, which according to this Reddit post might be the source of a memory leak:

    I have determined that the memory leak is due to a Page Table Entry(PTE) leak of processes that have been killed. While the processes spawned by Popen() have correctly executed and terminated, Windows, for some reason, keeps their PTEs in RAM, resulting in 32K~36K loss per instance that adds up to tens of GBs over a few hours. C++ showed the same symptom. This is a Windows problem.

Proposed solution

Since the cause is likely an OS-level issue rather than even one with the Python subprocess module, it would be fruitful to have some way to save state after finishing processing an edge-count for a given graph order, and then to have an option to restore from that save state and resume processing.

wbkboyer commented 2 days ago

Presently, I allow one to specify a list of -n/--orders of graphs for which one wishes to perform the edge-deletion analysis on each graph in each file per edge-count per order, and to integrate the numInvalidOKs for each edge-count into the final Test Table for that graph order. If I add another command-line parameter (-r/--resume) indicating we wish to resume EDA from a save-state, then an error must be thrown if more than one graph order is specified.

If the -r flag is used, validate the <RESUMEFILENAME> and either error out or restore the self._numInvalidOKs from the <RESUMEFILENAME> in TestTableGenerationWithInvalidOKs's __init__(). Currently, _determine_numInvalidOKs()'s outer loop iterates over the commands in self.perform_eda_for_commands, and the inner loop iterates over all edge-counts for the given graph order to _call_edge_deletion_analysis(); if self._numInvalidOKs is not {}, need to determine the max processed edge-count, and start the inner loop at the next highest edge-count (rather than from 0). Then, once we have processed a given edge-count, we should serialize the self._numInvalidOKs dict using the json module (the pickle module is another option, but has drawbacks, including vulnerability to arbitrary code execution) and refresh the <RESUMEFILENAME> before proceeding with the next iteration.