error in Multiprocessing.py with singularity

ssnn-airr commented 4 years ago

Original report by Ido Tamir (Bitbucket: ido, GitHub: ido).

Hello,

I run presto in a singularity 3.4.1 container on a slurm cluster with nextflow and when having multiple instances running I randomly get the error:

Command output:
  clip-c2-70 98352
  IDENTIFIER: 98352
  DIRECTORY: .
  PRESTO VERSION: 0.5.13-2019.08.29

  START
     1: FilterSeq quality        17:49 03/31/20
  ERROR:
      Traceback (most recent call last):
        File "/usr/local/bin/FilterSeq.py", line 239, in <module>
          filterSeq(**args_dict)
        File "/usr/local/bin/FilterSeq.py", line 83, in filterSeq
          nproc, queue_size)
        File "/usr/local/lib/python3.7/site-packages/presto/Multiprocessing.py", line 197, in manageProcesses
          alive = mp.Value(ctypes.c_bool, True)
        File "/usr/lib64/python3.7/multiprocessing/context.py", line 135, in Value
          ctx=self.get_context())
        File "/usr/lib64/python3.7/multiprocessing/sharedctypes.py", line 74, in Value
          obj = RawValue(typecode_or_type, *args)
        File "/usr/lib64/python3.7/multiprocessing/sharedctypes.py", line 49, in RawValue
          obj = _new_value(type_)
        File "/usr/lib64/python3.7/multiprocessing/sharedctypes.py", line 41, in _new_value
          wrapper = heap.BufferWrapper(size)
        File "/usr/lib64/python3.7/multiprocessing/heap.py", line 263, in __init__
          block = BufferWrapper._heap.malloc(size)
        File "/usr/lib64/python3.7/multiprocessing/heap.py", line 242, in malloc
          (arena, start, stop) = self._malloc(size)
        File "/usr/lib64/python3.7/multiprocessing/heap.py", line 134, in _malloc
          arena = Arena(length)
        File "/usr/lib64/python3.7/multiprocessing/heap.py", line 74, in __init__
          dir=self._choose_dir(size))
        File "/usr/lib64/python3.7/tempfile.py", line 340, in mkstemp
          return _mkstemp_inner(dir, prefix, suffix, flags, output_type)
        File "/usr/lib64/python3.7/tempfile.py", line 258, in _mkstemp_inner
          fd = _os.open(file, flags, 0o600)
      PermissionError: [Errno 13] Permission denied: '/dev/shm/pym-49784-3so0rtft'

This happens more or less randomly and I suspect it happens when the processes are on the same node. It did not happen when I processed only one dataset. Is this possible?

Its a bit difficult to debug. Do you know what I could do?

Thank you very much,

ido

ssnn-airr commented 4 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).

Ah, yeah, I was thinking of #65.

ssnn-airr commented 4 years ago

Original comment by Julian Zhou (Bitbucket: jqz, GitHub: julianqz).

I’ve actually not encountered this particular problem, or anything to do with multiprocessing or FilterSeq really. @{557058:4550da01-a8bf-43cd-9876-9fde3f096568} I think you might be thinking of my getting stuck with AssemblePairs, but rather because of blastn and something to do with the file system, and not because of MPI.

ssnn-airr commented 4 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).

This one is hard to debug. We do see this on some computing clusters, more often with AssemblePairs or AlignSets. I think it’s caused by running out of allocated memory.

@{557058:17032cad-7aa5-47e3-a2ff-083ece1e1478} , did you have any luck working around this on farnam?

ssnn-airr commented 2 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).

Reopen if it reappears.

immcantation / presto

error in Multiprocessing.py with singularity #72