simonsobs / socs

Simons Observatory specific OCS agents.
BSD 2-Clause "Simplified" License
12 stars 12 forks source link

Moves pysmurf-controller functions that load lots of data to subprocess #669

Closed jlashner closed 1 month ago

jlashner commented 2 months ago

This PR changes pysmurf-controller operations that load TODs such that instead of running in the main process, they call out to a subprocess.

Description

This introduces the smurf_subprocess_util module, and a protocol for running functions inside a twisted subprocess in such a way that a general RunCfg can be passed to it, and a general RunResult can be returned. Any operation that involves TOD data loading is modified so that the loading and analysis occurs in a subprocess rather than in the main one.

Motivation and Context

This should hopefully solve the memory leak, which I believe is coming from instances where TOD data is not being released to the system after it leaves scope. Moving TOD loading and analysis into subprocesses instead of the main one should enforce that this memory is returned to the OS when the subprocess exits. This discussion has some links that motivate this solution.

How Has This Been Tested?

I have tested the subprocess protocol locally through the use of the test process (and the new run_test_func operation). I will need to test all other operations with hardware on SATp1 or SATp3.

Types of changes

Checklist:

jlashner commented 2 months ago

This has been tested on satp1 and it works! It seems to have fixed the large memory leak. Notes from my testing is here: https://simonsobs.atlassian.net/wiki/spaces/~55705801070cc775254537ab15f2dcf6702b78/pages/412778522/2024-05-02+pysmurf-controller+subprocessing+tests

Notably, memory returns to base-level after IVs and bias steps, which it did not previously: image

I want to note that there does still seem to be a memory leak caused by the functions that have not been moved to subprocesses. I did not move the stream data function to a subprocess because it isn't loading TODs, however just the creation of the pysmurf object seems to be leaking memory, at a rate of ~15 MB per call: image

I think we should probably move all logic involving pysmurf to a subprocess, however I this this PR solves the most pressing issues and largest memory leaks, so I think we should merge this first.