Segway process uses too much memory

EricR86 commented 8 years ago

Original report (BitBucket issue) by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

The segway process (and not the jobs it spawns) is taking up a significant amount of memory for longer running jobs. The memory usage increases the longer segway is running and submitting jobs.

Here is an example of memory usage from top on an SGE Cluster:

21824 rachelc   20   0 47.3g  45g 9948 S  1.0 36.5  88:02.21 segway

Where 47.3 gigs is the virtual memory given to the process and 45g is the resident set size of the process.

Memory (python) profiling information was injected into the process to find out the following using guppy and pyringe:

>>> inject('print >>sys.stderr, guppy_hpy.heap()')

==> pid:[21824] #threads:[11] current thread:[139643938924288]

Partition of a set of 747222 objects. Total size = 113102584 bytes.

Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)

     0 387984  52 48701888  43  48701888  43 unicode

     1   8273   1  8670104   8  57371992  51 dict of segway.cluster.RestartableJob

     2   8273   1  8670104   8  66042096  58 dict of segway.cluster.sge.JobTemplateFactory

     3  79033  11  7895136   7  73937232  65 str

     4  13812   2  5983520   5  79920752  71 list

     5  17297   2  5808536   5  85729288  76 dict (no owner)

     6  69421   9  5518848   5  91248136  81 tuple

     7  20018   3  5167344   5  96415480  85 path.path

     8  16546   2  2382624   2  98798104  87 0x312e480

     9   8273   1  2316440   2 101114544  89 dict of drmaa.session.JobTemplate

<349 more rows. Type e.g. '_.more' to view.>

>>> inject('print >>sys.stderr, guppy_hpy.heap()[0].byvia')

==> pid:[21824] #threads:[11] current thread:[139643938924288]

Partition of a set of 387939 objects. Total size = 48696176 bytes.

Index  Count   %     Size   % Cumulative  % Referred Via:

     0   8280   2  4566200   9   4566200   9 '[6]'

     1   8280   2  2845808   6   7412008  15 '[54]'

     2   8280   2  2708536   6  10120544  21 '[50]'

     3   8280   2  2647320   5  12767864  26 '[46]'

     4   8279   2  2647256   5  15415120  32 '[42]'

     5   8280   2  2581136   5  17996256  37 '[48]'

     6   8278   2  2581040   5  20577296  42 '[18]'

     7   8280   2  2515000   5  23092296  47 '[26]'

     8   8280   2  2448824   5  25541120  52 '[52]'

     9   8289   2  1324488   3  26865608  55 '[0]'

<3663 more rows. Type e.g. '_.more' to view.>

>>> inject('print >>sys.stderr, guppy_hpy.heap()[0].byrcs')

==> pid:[21824] #threads:[11] current thread:[139643938924288]

Partition of a set of 387622 objects. Total size = 48655880 bytes.

Index  Count   %     Size   % Cumulative  % Referrers by Kind (class / dict of class)

     0 374216  97 47173576  97  47173576  97 list

     1   8265   2   925680   2  48099256  99 dict of segway.cluster.sge.JobTemplateFactory

     2    786   0   255264   1  48354520  99 dict (no owner)

     3   2577   1   164928   0  48519448 100 list, segway.cluster.RestartableJobDict

     4   1649   0   105536   0  48624984 100 segway.cluster.RestartableJobDict

     5     21   0    21264   0  48646248 100 function, tuple

     6     37   0     2152   0  48648400 100 tuple

     7     11   0     1848   0  48650248 100 dict of type

     8     20   0     1520   0  48651768 100 drmaa.session.JobInfo

     9      7   0     1368   0  48653136 100 dict of module

<9 more rows. Type e.g. '_.more' to view.>

>>> inject('print >>sys.stderr, guppy_hpy.heap()[0].byrcs[0].byvia')

==> pid:[21824] #threads:[11] current thread:[139643938924288]

Partition of a set of 374039 objects. Total size = 47151000 bytes.

Index  Count   %     Size   % Cumulative  % Referred Via:

     0   8268   2  4560568  10   4560568  10 '[6]'

     1   8270   2  2842368   6   7402936  16 '[54]'

     2   8270   2  2705248   6  10108184  21 '[50]'

     3   8270   2  2644120   6  12752304  27 '[46]'

     4   8269   2  2644056   6  15396360  33 '[42]'

     5   8270   2  2578016   5  17974376  38 '[48]'

     6   8268   2  2577920   5  20552296  44 '[18]'

     7   8270   2  2511960   5  23064256  49 '[26]'

     8   8270   2  2445864   5  25510120  54 '[52]'

     9   8268   2  1322208   3  26832328  57 '[0]'

<517 more rows. Type e.g. '_.more' to view.>

>>> inject('print >>sys.stderr, guppy_hpy.heap()[0].byrcs[0].byrcs')

==> pid:[21824] #threads:[11] current thread:[139643938924288]

Partition of a set of 373686 objects. Total size = 47105912 bytes.

Index  Count   %     Size   % Cumulative  % Referrers by Kind (class / dict of class)

     0 373686 100 47105912 100  47105912 100 list

>>> inject('print >>sys.stderr, guppy_hpy.heapu()')

==> pid:[21824] #threads:[11] current thread:[139643938924288]

Data from unreachable objects.

Partition of a set of 905 objects. Total size = 317792 bytes.

Index  Count   %     Size   % Cumulative  % Class

     0    108  12   215328  68    215328  68 dict

     1     40   4    34880  11    250208  79 type

     2    104  11    14656   5    264864  83 list

     3    155  17    12920   4    277784  87 str

     4    125  14    12528   4    290312  91 tuple

     5    144  16    11520   4    301832  95 __builtin__.wrapper_descriptor

     6     53   6     3816   1    305648  96 types.MemberDescriptorType

     7     51   6     3672   1    309320  97 __builtin__.method_descriptor

     8     35   4     2520   1    311840  98 types.BuiltinFunctionType

     9     19   2     1368   0    313208  99 types.GetSetDescriptorType

<42 more rows. Type e.g. '_.more' to view.>

>>> inject('print >>sys.stderr, guppy_hpy.heap().byclodo')

==> pid:[21824] #threads:[11] current thread:[139643938924288]

Partition of a set of 746177 objects. Total size = 112935584 bytes.

Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)

     0 387171  52 48598408  43  48598408  43 unicode

     1   8255   1  8651240   8  57249648  51 dict of segway.cluster.RestartableJob

     2   8255   1  8651240   8  65900888  58 dict of segway.cluster.sge.JobTemplateFactory

     3  79035  11  7895784   7  73796672  65 str

     4  13790   2  5972464   5  79769136  71 list

     5  17279   2  5805800   5  85574936  76 dict (no owner)

     6  69375   9  5515384   5  91090320  81 tuple

     7  19982   3  5157728   5  96248048  85 path.path

     8  16510   2  2377440   2  98625488  87 0x312e480

     9   8255   1  2311400   2 100936888  89 dict of drmaa.session.JobTemplate

<355 more rows. Type e.g. '_.more' to view.>

>>> inject('print >>sys.stderr, guppy_hpy.heap().byid')

==> pid:[21824] #threads:[11] current thread:[139643938924288]

Set of 745706 <mixed> objects. Total size = 112859584 bytes.

Index     Size   %   Cumulative  %   Brief

     0    98600   0.1     98600   0.1 segway.cluster.RestartableJobDict: 0x7ef57e1f0bc0

     1    98600   0.1    197200   0.2 segway.cluster.RestartableJobDict: 0x7efa7a9a2bc0

     2    98584   0.1    295784   0.3 dict (no owner): 0x2f3aba0*843

     3    49448   0.0    345232   0.3 segway.cluster.RestartableJobDict: 0x7ef552d98fc0

     4    49448   0.0    394680   0.3 segway.cluster.RestartableJobDict: 0x7ef55b9b13c0

     5    49448   0.0    444128   0.4 segway.cluster.RestartableJobDict: 0x7ef562907ba0

     6    49448   0.0    493576   0.4 segway.cluster.RestartableJobDict: 0x7ef571e837c0

     7    49448   0.0    543024   0.5 segway.cluster.RestartableJobDict: 0x7ef5749199b0

     8    49448   0.0    592472   0.5 segway.cluster.RestartableJobDict: 0x7ef578db32f0

     9    49448   0.0    641920   0.6 segway.cluster.RestartableJobDict: 0x7ef595e1a400

<745696 more rows. Type e.g. '_.more' to view.>

>>> inject('print >>sys.stderr, guppy_hpy.heap().byrcs')

==> pid:[21824] #threads:[11] current thread:[139643938924288]

Partition of a set of 745653 objects. Total size = 112849664 bytes.

Index  Count   %     Size   % Cumulative  % Referrers by Kind (class / dict of class)

     0 423460  57 51153360  45  51153360  45 list

     1  57946   8 11533632  10  62686992  56 dict of segway.cluster.sge.JobTemplateFactory

     2   8247   1  8642856   8  71329848  63 segway.cluster.RestartableJob

     3   8247   1  8642856   8  79972704  71 segway.cluster.sge.JobTemplateFactory

     4  45260   6  4468840   4  84441544  75 types.CodeType

     5   8247   1  2309160   2  86750704  77 0x312e480, dict (no owner)

     6   8247   1  2309160   2  89059864  79 0x312eb10

     7   8247   1  2309160   2  91369024  81 drmaa.session.JobTemplate

     8  51241   7  1960096   2  93329120  83 tuple

     9  16544   2  1899344   2  95228464  84 dict (no owner)

<1372 more rows. Type e.g. '_.more' to view.>

>>> inject('print >>sys.stderr, guppy_hpy.heap().byrcs[0].byid')

==> pid:[21824] #threads:[11] current thread:[139643938924288]

Set of 422945 <mixed> objects. Total size = 51089632 bytes.

Index     Size   %   Cumulative  %   Brief

     0     3352   0.0      3352   0.0 dict (no owner): 0x2cee9f0*24

     1     3352   0.0      6704   0.0 dict (no owner): 0x2cef1e0*24

     2     3352   0.0     10056   0.0 dict (no owner): 0x2cef300*24

     3     3352   0.0     13408   0.0 dict (no owner): 0x2cef550*24

     4     3352   0.0     16760   0.0 dict (no owner): 0x2cf4be0*24

     5     3352   0.0     20112   0.0 dict (no owner): 0x2cf5b90*24

     6     3352   0.0     23464   0.0 dict (no owner): 0x2cf6b40*24

     7     3352   0.0     26816   0.1 dict (no owner): 0x2cf7af0*24

     8     3352   0.0     30168   0.1 dict (no owner): 0x2cf7e90*24

     9     3352   0.0     33520   0.1 dict (no owner): 0x2cf8230*24

<422935 more rows. Type e.g. '_.more' to view.>

>>> inject('print >>sys.stderr, guppy_hpy.heap().byrcs[0].theone')

==> pid:[21824] #threads:[11] current thread:[139643938924288]

Partition of a set of 747342 objects. Total size = 113121328 bytes.

Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)

     0 388073  52 48713320  43  48713320  43 unicode

     1   8275   1  8672200   8  57385520  51 dict of segway.cluster.RestartableJob

     2   8275   1  8672200   8  66057720  58 dict of segway.cluster.sge.JobTemplateFactory

     3  79031  11  7895000   7  73952720  65 str

     4  13814   2  5984752   5  79937472  71 list

     5  17295   2  5807976   5  85745448  76 dict (no owner)

     6  69428   9  5519344   5  91264792  81 tuple

     7  20022   3  5168432   5  96433224  85 path.path

     8  16550   2  2383200   2  98816424  87 0x312e480

     9   8275   1  2317000   2 101133424  89 dict of drmaa.session.JobTemplate

<349 more rows. Type e.g. '_.more' to view.>

Traceback (most recent call last):

  File "<string>", line 1, in <module>

  File "/mnt/work1/users/home2/rachelc/.local/lib/python2.7/site-packages/guppy/heapy/UniSet.py", line 602, in <lambda>

    theone = property(lambda self: self.fam.get_theone(self), doc="""\

  File "/mnt/work1/users/home2/rachelc/.local/lib/python2.7/site-packages/guppy/heapy/UniSet.py", line 1784, in get_theone

    raise ValueError, 'theone requires a singleton set'

ValueError: theone requires a singleton set

Guppy documentation is incredibly difficult to come by especially with the syntax used. It is essentially a declarative domain specific language to select information about the python memory of a given program. Notably .heap() returns a selection of the entire Python heap reachable by python objects and .heapu() return the heap of unreachable python objects (that should be garbage collected). It is the heap itself that shows there is an unreasonable amount of unicode objects taking up a significant amount of memory. byrcs selects on the heap or part of the heap and organizes by "referrers by kind (class)". byvia does the something similar except it shows where the objects are being referred via.

From the preliminary data it looks like there is a massive list (of a list?) of unicode objects. Presumably this list grows the longer segway runs.

EricR86 commented 8 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

Edited issue description

EricR86 commented 8 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

--num-instances seems to be tied to this issue.

Very little memory is "leaked" if at all for --num-instances 2 (still multithreaded). However when --num-instances when set to 10, significant memory seems to "leak" by the megabyte within a minute.

This results were consistent with Segway currently (tip) and the Segway 1.2.0 release.

I cannot run an earlier release of Segway since the cluster system is not recognized.

EricR86 commented 8 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

Memory is still leaked regardless of segway being threaded or not. It is leaked sooner and in faster amounts when run threaded however and is proportional to the number of threads.

EricR86 commented 8 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

It is worth noting all python string objects assigned to DRMAA job template attributes are automatically "cast" to unicode objects. This is done by the Python DRMAA API by overriding the __set__ operator and calling the C library reference implementation. It is the C library implementation call that returns a unicode object.

EricR86 commented 8 years ago

Original comment by Rachel Chan (Bitbucket: rcwchan).

Fixed in pull request #54 ( @ericr86 I am unable to mark this as resolved)

EricR86 commented 8 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

changed state from "new" to "resolved"

Fixed in pull request #54

hoffmangroup / segway

Segway process uses too much memory #60