pnnl / deimos

BSD 3-Clause "New" or "Revised" License
32 stars 6 forks source link

partitions.map(): RunTimeError when run in a Python script #20

Open jessieolough opened 7 months ago

jessieolough commented 7 months ago

Hi,

I am currently working on running the DEIMoS commands within a Python script from the DEIMoS environment terminal and saving the relevant outputs in the working directory. This has worked well so far for the functions in the Peak Detection tutorial except down to the partitions.map() function. When this function is run in the script, the following RunTimeError message is infinitely repeated:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\jessieolough\AppData\Local\anaconda3\envs\deimos\Lib\multiprocessing\spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\jessieolough\AppData\Local\anaconda3\envs\deimos\Lib\multiprocessing\spawn.py", line 131, in _main
    prepare(preparation_data)
  File "C:\Users\jessieolough\AppData\Local\anaconda3\envs\deimos\Lib\multiprocessing\spawn.py", line 246, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Users\jessieolough\AppData\Local\anaconda3\envs\deimos\Lib\multiprocessing\spawn.py", line 297, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen runpy>", line 286, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "F:\JessicaOLoughlin\ExampleDataUserGuide\MSV000091746\raw\20240401_PeakDetection_script_simplified.py", line 73, in <module>
    ms1_peaks_partitioned = partitions.map(deimos.peakpick.local_maxima,
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\users\jessieolough\deimos\deimos\subset.py", line 395, in map
    with mp.Pool(processes=processes) as p:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\jessieolough\AppData\Local\anaconda3\envs\deimos\Lib\multiprocessing\context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\jessieolough\AppData\Local\anaconda3\envs\deimos\Lib\multiprocessing\pool.py", line 215, in __init__
    self._repopulate_pool()
  File "C:\Users\jessieolough\AppData\Local\anaconda3\envs\deimos\Lib\multiprocessing\pool.py", line 306, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\jessieolough\AppData\Local\anaconda3\envs\deimos\Lib\multiprocessing\pool.py", line 329, in _repopulate_pool_static
    w.start()
  File "C:\Users\jessieolough\AppData\Local\anaconda3\envs\deimos\Lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "C:\Users\jessieolough\AppData\Local\anaconda3\envs\deimos\Lib\multiprocessing\context.py", line 337, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "C:\Users\jessieolough\AppData\Local\anaconda3\envs\deimos\Lib\multiprocessing\popen_spawn_win32.py", line 46, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\jessieolough\AppData\Local\anaconda3\envs\deimos\Lib\multiprocessing\spawn.py", line 164, in get_preparation_data
    _check_not_importing_main()
  File "C:\Users\jessieolough\AppData\Local\anaconda3\envs\deimos\Lib\multiprocessing\spawn.py", line 140, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

        To fix this issue, refer to the "Safe importing of main module"
        section in https://docs.python.org/3/library/multiprocessing.html

The function works perfectly well when run independently directly in Python in the terminal and gives the expected True output when all(ms1_peaks_partitioned == ms1_peaks) is run after it.

I am unsure what would be causing this issue when run in the script and any help would be much appreciated.

Many thanks, Jess

Here is the entire script I am running for reference:

import deimos
import numpy as np
import matplotlib.pyplot as plt

##Persistent Homology

# Load data, excluding scanid column
ms1 = deimos.load('example_data.h5', key='ms1', columns=['mz', 'drift_time', 'retention_time', 'intensity'])

# Build factors from raw data
factors = deimos.build_factors(ms1, dims='detect')

# Nominal threshold
ms1 = deimos.threshold(ms1, threshold=500)

# Build index
index = deimos.build_index(ms1, factors)

# Smooth data
ms1 = deimos.filters.smooth(ms1, index=index, dims=['mz', 'drift_time', 'retention_time'],
                            radius=[0, 1, 0], iterations=7)

# Perform peak detection
ms1_peaks = deimos.peakpick.persistent_homology(ms1, index=index,
                                                dims=['mz', 'drift_time', 'retention_time'],
                                                radius=[2, 10, 0])

# Sort by persistence
ms1_peaks = ms1_peaks.sort_values(by='persistence', ascending=False).reset_index(drop=True)

ms1_peaks.to_csv('run_PeakDectection_ms1_peaks_PersistentHomology.csv', index=False)

##Maximum Filtration

# Load data, excluding scanid column
ms1 = deimos.load('example_data.h5', key='ms1', columns=['mz', 'drift_time', 'retention_time', 'intensity'])

# Sum over retention time
ms1_2d = deimos.collapse(ms1, keep=['mz', 'drift_time'])

# Perform peak detection
ms1_peaks = deimos.peakpick.local_maxima(ms1_2d, dims=['mz', 'drift_time'], bins=[37, 9])

ms1_peaks.to_csv('run_PeakDectection_ms1_peaks_MaximumFiltration.csv', index=False)

##Selecting Kernel Size

# Subset to lower mass range
ms1_ss = deimos.slice(ms1, by='mz', low=150, high=250)

# Get maximal data point
mz_i, dt_i, rt_i, intensity_i = ms1_ss.loc[ms1_ss['intensity'] == ms1_ss['intensity'].max(), :].round(1).values[0]

# Subset the raw data
feature = deimos.slice(ms1,
                       by=['mz', 'drift_time', 'retention_time'],
                       low=[mz_i - 0.01, dt_i - 0.6, rt_i - 0.3],
                       high=[mz_i + 0.15, dt_i + 0.6, rt_i + 0.3])

feature.to_csv('run_PeakDectection_feature.csv', index=False)

# Visualize
deimos.plot.multipanel(feature, dpi=150)
plt.tight_layout()

plt.savefig('run_PeakDetection_SelectingKernelSize_GraphicalOutput.png')

# Partition the data
partitions = deimos.partition(ms1_2d, split_on='mz', size=10000, overlap=0.5)

# Map peak detection over partitions
ms1_peaks_partitioned = partitions.map(deimos.peakpick.local_maxima,
                                       dims=['mz', 'drift_time'],
                                       bins=[37, 9],
                                       processes=8)
#If the ms1_peaks_partitioned function is removed, the rest of the script runs perfectly
smcolby commented 6 months ago

I'll take a look at what might be going on, but I'll note that persistent homology is better in every sense compared to the local maxima-based peak detection. We've only kept that functionality around since it was used in the 2022 paper, but realistically should just deprecate it.