Make MI Windows compatible

Currently MI is only compatible with Linux. It would be good for it to run on Windows and potentially Mac too.

Following some initial investigation, the below is a suggested plan for the issue (see notes in the issue comments for more detail on the steps given):

1) Address the issues under Setting up the mantidimaging-dev environment - change the versions of the packages we're using so that our mantidimaging-dev conda environment is Windows compatible.

2) Set the style of the application to 'Fusion', as detailed under GUI features - this is a very small change that shouldn't make any visible difference on Linux, but would mean the application had a consistent look and feel when run on Windows.

3) Address the issues with checking the CUDA installation detailed in Running the application - update the code so it works on both Windows and Linux.

These first steps would mean developers could get Mantid Imaging up and running on both Windows and Linux without having to tweak settings for Windows. The application won't yet be fully functional on Windows.

4) Test the reconstruction speed on some other Windows machines to get a better sense of how slow it really is. This will help figure out how viable it is to use the system on Windows, or if we potentially need to chat with the CIL developers about algorithm speeds on Windows.

5) Work through the steps given in the suggested approach for resolving the multiprocessing issues. Each of the three steps should be possible to do as a separate PR. Step 1 will update the way we're using shared memory to lay the foundations for Windows compatibility, but the application will still only work with forked processes at this point. Step 2 will change the application so that it should work with both spawn and fork. At this point the multiprocessing should work on both Windows and Linux, however performance on Windows will be very slow. Step 3 will address the performance issues on Windows, but will require a design decision about whether we want to start using spawn on Linux and also whether we want to create processes up front on both Linux and Windows.

6) Fix the remaining issues listed under GUI features and any others that arise from testing on Windows. After this step the system should be fully functional on Windows.

Outcomes of investigation into Windows compatibility, Feb 2022:

Summary

The major obstacle with Windows compatibility is that our multiprocessing functionality is written to work with fork but not spawn. When processes are spawned on Windows, everything is started from scratch within the new process, including running the main method that was responsible for the call that created the new process (i.e. starting up Mantid Imaging). This takes time and means performance is poor. In addition to this, data needs to be passed into the new process and it takes a significant amount of time to do this with the large arrays MI is working with. We don't have these problems on Linux because processes are creating using fork (only available on Linux), which takes a lightweight copy of the process (including the data) and proceeds from there.

We would need to look into the following to resolve these key issues:

1) Sharing data between processes efficiently. This is the most significant cause of performance issues, so it would be good to look at this first before proceeding further. We can use shared memory to allow the processes to access the same data and, from testing, this appears to resolve the problem. We would need to look at how best to implement this in our code, however the existing create_array method in core/parallel/utility.py creates shared memory in the way we need. The only issue is that we then read the shared memory and return it as a numpy array, which is not shared. We would need to separate out the creation of the shared memory and its retrieval as a numpy array so that we can "pass" only the shared memory into the process and then read it into a numpy array from there, i.e. the following code would be executed from within the process:

numpy_array = np.frombuffer(shared_memory_array.get_obj(), dtype).reshape(shape)

2) Even without passing data, the time taken to create new processes is still too slow on Windows due to the time required to start up Mantid Imaging (this seems to be predominantly because of the very large number of imports). We could look into creating a pool of processes up front and see if this improves things by reducing the number of times we're creating pools and also doing it as part of start-up rather than at the point of use. Another possibility would be to try and self-contain the multiprocessing parts of the application so that they can be called as a script from MI. By starting the multiprocessing from its own main method, we could focus on importing only those things that we really need for the parallel processing and cut down the start-up time. This would require research to see what the best approach might be for this and how much refactoring would be required in the code.

Detailed findings on Windows compatibility:

Setting up the mantidimaging-dev environment

astra-toolbox=1.9.9.dev4 specified in conda/meta.yaml doesn't have a Windows equivalent. The closest is 1.9.9.dev6, which can be used instead.

Trying to use cupy 9.4. with cudatoolkit 9.2. causes this error: nothing provides openssl >=1.1.1m,<1.1.2a needed by python-3.9.10-h6cc024a_1. The compatible version of cudatoolkit is 10.2.*, which is pulled in automatically by cupy if omitted from the conda/meta.yaml file but can also be specified in the yaml if preferred.

Running the application

python must be used in commands instead of python3. Would be good to check the pre-commit commands still work OK on Windows.
The code for checking working CUDA installation needs updating in the _cuda_is_present method of core/utility/cuda_check.py to work on Windows. While testing I got around this by changing the method to return True for Windows OS, which allowed me to run all the reconstructions that require CUDA. Note that nvidia-smi.exe isn't added to the path during CUDA installation on Windows. It was located here on my machine: C:\Program Files\NVIDIA Corporation\NVSMI, so I was able to add it manually. I'm not sure if this is required for anything other than the initial CUDA check, but I suspect not.

GUI features

We might like to change the style to use Fusion for a more modern look and feel on Windows, and more similarity with the Linux appearance. This can be done in the setup_application method of gui/gui.py as follows:

def setup_application():
    q_application = QApplication(sys.argv)
    q_application.setStyle('Fusion').

I haven't noticed a difference when setting this style in Linux, but we could just set it for Windows OS if preferred by using platform.system() to check the system. Text on Windows may be a little small, so it might be good to look into options to increase it's size slightly, but this isn't a priority.

The background image in the welcome/about dialog doesn't appear.
There is a question mark in the top right of dialog menu bars that does nothing.
The Cancel and OK buttons are the other way around in dialogs. This is a Windows default, so is something to note but no action needed.
When another window is open, clicking on the Main window doesn't bring it to the front. Also minimising the Main window causes all open windows to minimise.

Multiprocessing

Several operations are affected by the spawn/fork issue: flat fielding, remove outliers, monitor normalisation, ROI normalisation, arithmetic, Gaussian, median without using GPU, rebin, rotate stack, remove all stripes and most other stripe removal filters fail (I assume nan doesn't work either). These throw a list index out of range error because they try to access the shared_list variable from core/parallel/shared.py, which has no value when it's been created in the new process. Similarly, the Correlate 0 and 180 functionality on the recon window has the same issue.
All of the reconstructions seem to run through without error and with correct results, but a closer investigation would be good to confirm that they're not being affected by any issues.
A number of unit tests are failing with index out of range errors, presumably also due to multiprocessing problems.

The following are some bits of code that were useful in investigating the multiprocessing:

To set the multiprocessing pool to use spawn for testing purposes on Linux:

In parallel/utility.py, execute_impl: with Pool(cores, context=multiprocessing.get_context('spawn')) as pool:

The following can be used to share data between spawned processes. The data passed in to initargs must be shared memory (i.e. an array created from multiprocessing.Array()) to avoid pickling of the data. This can then be read as a numpy array from within the process:

# Create an initializer function that can be passed into the pool to set the shared data for the spawned process:
def pool_init_method(shared_data):
    data_for_process = shared_data
...
with Pool(8, initializer=pool_init_method, initargs=(shared_data,)) as pool:

Suggested approach to resolve multiprocessing problems:

After further investigation, I think the cleanest approach would be to create the multiprocessing pool when the application starts up. To support this, we would need to switch to use multiprocessing.shared_memory instead of multiprocessing.Array. This would allow us to access array data from within a spawned process without the performance hit from serialization and without needing the array to exist at process creation.

Below are some specific details on what I think we would need to change:

1) Refactor the create_array method in core/parallel/utility.py so that it uses shared_memory and returns a tuple containing both the array and the memory that was created. It will be necessary to move away from using multiprocessing.Array because it cannot be passed in to an already running process. I believe that continuing to return the array should help reduce the amount of refactoring required elsewhere, while returning the memory is necessary so that we can assign it to a variable to prevent premature garbage collection. In most cases the shared memory array will be associated with an Images object, so I would recommend adding a field to the Images class that can hold the corresponding memory location.

2) Refactor core/parallel/shared.py so that we pass in a list of tuples giving the memory name, and array shape and dtype. This will allow the methods to look up the shared array from shared memory instead of using the global variable shared_list, which is empty when processes are created via spawn. The operations and other locations that use multiprocessing will need to be updated to work with the changes.

3) A pool_manager module should be added and should provide a global variable for referencing the application's process pool. This should make it easily accessible elsewhere in the application. The new module should provide methods for creating and starting up the pool and closing it when the application ends. These methods would be called around the call to gui.execute() in the application's main() method so that the pool is created at application start. The execute_impl method in core/parallel/utility.py would then be updated to use the pool from the pool_manager instead of creating it's own.

We would need to be careful about memory leaks and check that processes are closed as expected when the application terminates. We would also need to check the effect of these changes on Linux and decide if we're going to use spawn everywhere (it can be specified on Linux) or continue using fork on Linux.

Here are some code snippets that could be useful in implementing the above:

1) Create the pool and start the pool processes up front. Something like the following as a pool manager module, which can define methods for starting and closing pools:

LOG = getLogger(__name__)

pool = None

def create_and_start_pool():
    LOG.info('Creating process pool')
    cores = multiprocessing.cpu_count()
    global pool
    pool = Pool(cores)
    LOG.info('Starting up processes in pool')
    # We need a function to call to start the processes but the function itself doesn't need to do anything
    # If we don't do this then the processes start when the pool is first called later in the application
        # which affects performance
    pool.map_async(_do_nothing, range(cores))

def end_pool():
    pool.close()
    pool.join()

def _do_nothing(i):
    pass

2) Use shared memory to allow image data to be accessed from the already started processes:

The below creates the shared memory. The name parameter is optional, a name will be generated if one is not provided (in our case, I would recommend we don't specify the name ourselves for simplicity). The shared memory is garbage collected if a pointer to the memory is not held. This is regardless of whether the contents of the memory have been read into an array. We would need to refactor our code to ensure that a pointer to the shared memory is maintained as needed. We also need to be able to access the name of the memory to read the array later. In most cases, I think we could achieve this by adding a memory_location field to the Images class that would hold this information through the life of the stack.

mem = shared_memory.SharedMemory(name='test', create=True, size=size)

The below attaches to the shared memory and reads the content into a numpy array. This requires the name of the memory, plus the shape and dtype of the array to be available to the process. We can look at extending the existing partial functions to pass these values in to the process as a list of tuples (as some methods work with more than one data array).

sm = shared_memory.SharedMemory(name='test')
shared_array = np.ndarray(shape, dtype=dtype, buffer=sm.buf)

On the reconstruction side of Mantid Imaging, I can't see anywhere in our code where we're using multiprocessing - I think any multiprocessing/multithreading is happening within the calls to the libraries we're using. Running reconstructions on Windows seems to work OK, but is significantly slower than on Linux; reconstructing the volume for the flowers dataset using FBP takes approximately 2 minutes 30 seconds on Windows (on my laptop) vs 40 seconds to do the same on a Linux instance. It would be good to check the reconstruction performance on other Windows machines.

Report on step 4 of the issue plan (reconstruction speeds on Windows):

Machines need to have the appropriate amount of RAM and a high enough spec GPU plus NVIDIA drivers to successfully run reconstructions. Once we're ready to provide Windows support it might be good to add some more details on device requirements to our documentation.

On a Windows machine with a higher spec GPU, driver 460.89 and CUDA v11.2, FBP reconstruction was about two to three times slower than Linux (depending on the size of the dataset) and Astra and CIL were marginally slower. A more up to date NVIDIA driver (v511.79) made FBP slower and had no significant effect on the other two methods.

We concluded that this was still acceptable performance on Windows for the software to be usable. At some point it might be helpful to look into performance as a task in itself to see if there are any areas where we can speed things up, which would likely benefit all platforms.

mantidproject / mantidimaging

Make MI Windows compatible #1331