mantidproject / mantidimaging

Graphical toolkit for neutron imaging.
https://mantidproject.github.io/mantidimaging
GNU General Public License v3.0
13 stars 6 forks source link

Make MI Windows compatible #1331

Closed rbauststfc closed 2 years ago

rbauststfc commented 2 years ago

Currently MI is only compatible with Linux. It would be good for it to run on Windows and potentially Mac too.

Following some initial investigation, the below is a suggested plan for the issue (see notes in the issue comments for more detail on the steps given):

1) Address the issues under Setting up the mantidimaging-dev environment - change the versions of the packages we're using so that our mantidimaging-dev conda environment is Windows compatible.

2) Set the style of the application to 'Fusion', as detailed under GUI features - this is a very small change that shouldn't make any visible difference on Linux, but would mean the application had a consistent look and feel when run on Windows.

3) Address the issues with checking the CUDA installation detailed in Running the application - update the code so it works on both Windows and Linux.

These first steps would mean developers could get Mantid Imaging up and running on both Windows and Linux without having to tweak settings for Windows. The application won't yet be fully functional on Windows.

4) Test the reconstruction speed on some other Windows machines to get a better sense of how slow it really is. This will help figure out how viable it is to use the system on Windows, or if we potentially need to chat with the CIL developers about algorithm speeds on Windows.

5) Work through the steps given in the suggested approach for resolving the multiprocessing issues. Each of the three steps should be possible to do as a separate PR. Step 1 will update the way we're using shared memory to lay the foundations for Windows compatibility, but the application will still only work with forked processes at this point. Step 2 will change the application so that it should work with both spawn and fork. At this point the multiprocessing should work on both Windows and Linux, however performance on Windows will be very slow. Step 3 will address the performance issues on Windows, but will require a design decision about whether we want to start using spawn on Linux and also whether we want to create processes up front on both Linux and Windows.

6) Fix the remaining issues listed under GUI features and any others that arise from testing on Windows. After this step the system should be fully functional on Windows.

rbauststfc commented 2 years ago

Outcomes of investigation into Windows compatibility, Feb 2022:

Summary

The major obstacle with Windows compatibility is that our multiprocessing functionality is written to work with fork but not spawn. When processes are spawned on Windows, everything is started from scratch within the new process, including running the main method that was responsible for the call that created the new process (i.e. starting up Mantid Imaging). This takes time and means performance is poor. In addition to this, data needs to be passed into the new process and it takes a significant amount of time to do this with the large arrays MI is working with. We don't have these problems on Linux because processes are creating using fork (only available on Linux), which takes a lightweight copy of the process (including the data) and proceeds from there.

We would need to look into the following to resolve these key issues:

1) Sharing data between processes efficiently. This is the most significant cause of performance issues, so it would be good to look at this first before proceeding further. We can use shared memory to allow the processes to access the same data and, from testing, this appears to resolve the problem. We would need to look at how best to implement this in our code, however the existing create_array method in core/parallel/utility.py creates shared memory in the way we need. The only issue is that we then read the shared memory and return it as a numpy array, which is not shared. We would need to separate out the creation of the shared memory and its retrieval as a numpy array so that we can "pass" only the shared memory into the process and then read it into a numpy array from there, i.e. the following code would be executed from within the process:

numpy_array = np.frombuffer(shared_memory_array.get_obj(), dtype).reshape(shape)

2) Even without passing data, the time taken to create new processes is still too slow on Windows due to the time required to start up Mantid Imaging (this seems to be predominantly because of the very large number of imports). We could look into creating a pool of processes up front and see if this improves things by reducing the number of times we're creating pools and also doing it as part of start-up rather than at the point of use. Another possibility would be to try and self-contain the multiprocessing parts of the application so that they can be called as a script from MI. By starting the multiprocessing from its own main method, we could focus on importing only those things that we really need for the parallel processing and cut down the start-up time. This would require research to see what the best approach might be for this and how much refactoring would be required in the code.

Detailed findings on Windows compatibility:

Setting up the mantidimaging-dev environment

Running the application

GUI features

def setup_application():
    q_application = QApplication(sys.argv)
    q_application.setStyle('Fusion').

I haven't noticed a difference when setting this style in Linux, but we could just set it for Windows OS if preferred by using platform.system() to check the system. Text on Windows may be a little small, so it might be good to look into options to increase it's size slightly, but this isn't a priority.

Multiprocessing

The following are some bits of code that were useful in investigating the multiprocessing:

To set the multiprocessing pool to use spawn for testing purposes on Linux:

In parallel/utility.py, execute_impl: with Pool(cores, context=multiprocessing.get_context('spawn')) as pool:

The following can be used to share data between spawned processes. The data passed in to initargs must be shared memory (i.e. an array created from multiprocessing.Array()) to avoid pickling of the data. This can then be read as a numpy array from within the process:

# Create an initializer function that can be passed into the pool to set the shared data for the spawned process:
def pool_init_method(shared_data):
    data_for_process = shared_data
...
with Pool(8, initializer=pool_init_method, initargs=(shared_data,)) as pool:
rbauststfc commented 2 years ago

Suggested approach to resolve multiprocessing problems:

After further investigation, I think the cleanest approach would be to create the multiprocessing pool when the application starts up. To support this, we would need to switch to use multiprocessing.shared_memory instead of multiprocessing.Array. This would allow us to access array data from within a spawned process without the performance hit from serialization and without needing the array to exist at process creation.

Below are some specific details on what I think we would need to change:

1) Refactor the create_array method in core/parallel/utility.py so that it uses shared_memory and returns a tuple containing both the array and the memory that was created. It will be necessary to move away from using multiprocessing.Array because it cannot be passed in to an already running process. I believe that continuing to return the array should help reduce the amount of refactoring required elsewhere, while returning the memory is necessary so that we can assign it to a variable to prevent premature garbage collection. In most cases the shared memory array will be associated with an Images object, so I would recommend adding a field to the Images class that can hold the corresponding memory location.

2) Refactor core/parallel/shared.py so that we pass in a list of tuples giving the memory name, and array shape and dtype. This will allow the methods to look up the shared array from shared memory instead of using the global variable shared_list, which is empty when processes are created via spawn. The operations and other locations that use multiprocessing will need to be updated to work with the changes.

3) A pool_manager module should be added and should provide a global variable for referencing the application's process pool. This should make it easily accessible elsewhere in the application. The new module should provide methods for creating and starting up the pool and closing it when the application ends. These methods would be called around the call to gui.execute() in the application's main() method so that the pool is created at application start. The execute_impl method in core/parallel/utility.py would then be updated to use the pool from the pool_manager instead of creating it's own.

We would need to be careful about memory leaks and check that processes are closed as expected when the application terminates. We would also need to check the effect of these changes on Linux and decide if we're going to use spawn everywhere (it can be specified on Linux) or continue using fork on Linux.

rbauststfc commented 2 years ago

Here are some code snippets that could be useful in implementing the above:

1) Create the pool and start the pool processes up front. Something like the following as a pool manager module, which can define methods for starting and closing pools:

LOG = getLogger(__name__)

pool = None

def create_and_start_pool():
    LOG.info('Creating process pool')
    cores = multiprocessing.cpu_count()
    global pool
    pool = Pool(cores)
    LOG.info('Starting up processes in pool')
    # We need a function to call to start the processes but the function itself doesn't need to do anything
    # If we don't do this then the processes start when the pool is first called later in the application
        # which affects performance
    pool.map_async(_do_nothing, range(cores))

def end_pool():
    pool.close()
    pool.join()

def _do_nothing(i):
    pass

2) Use shared memory to allow image data to be accessed from the already started processes:

mem = shared_memory.SharedMemory(name='test', create=True, size=size)
sm = shared_memory.SharedMemory(name='test')
shared_array = np.ndarray(shape, dtype=dtype, buffer=sm.buf)
rbauststfc commented 2 years ago

On the reconstruction side of Mantid Imaging, I can't see anywhere in our code where we're using multiprocessing - I think any multiprocessing/multithreading is happening within the calls to the libraries we're using. Running reconstructions on Windows seems to work OK, but is significantly slower than on Linux; reconstructing the volume for the flowers dataset using FBP takes approximately 2 minutes 30 seconds on Windows (on my laptop) vs 40 seconds to do the same on a Linux instance. It would be good to check the reconstruction performance on other Windows machines.

rbauststfc commented 2 years ago

Report on step 4 of the issue plan (reconstruction speeds on Windows):

Machines need to have the appropriate amount of RAM and a high enough spec GPU plus NVIDIA drivers to successfully run reconstructions. Once we're ready to provide Windows support it might be good to add some more details on device requirements to our documentation.

On a Windows machine with a higher spec GPU, driver 460.89 and CUDA v11.2, FBP reconstruction was about two to three times slower than Linux (depending on the size of the dataset) and Astra and CIL were marginally slower. A more up to date NVIDIA driver (v511.79) made FBP slower and had no significant effect on the other two methods.

We concluded that this was still acceptable performance on Windows for the software to be usable. At some point it might be helpful to look into performance as a task in itself to see if there are any areas where we can speed things up, which would likely benefit all platforms.