Multiple workers is slower than one during termporal unwrapping

scottstanie commented 4 months ago

Single worker speed:

2024-07-20 09:01:15,830 [190193] [INFO] spurt: Temporal: Preparing batch 92/1105
2024-07-20 09:01:16,110 [190193] [INFO] spurt: Temporal: Unwrapping batch 92/1105
2024-07-20 09:01:19,904 [190193] [INFO] spurt: Temporal: Preparing batch 93/1105
2024-07-20 09:01:20,193 [190193] [INFO] spurt: Temporal: Unwrapping batch 93/1105
2024-07-20 09:01:24,090 [190193] [INFO] spurt: Temporal: Preparing batch 94/1105
2024-07-20 09:01:24,347 [190193] [INFO] spurt: Temporal: Unwrapping batch 94/1105

8 temporal workers

Processing batch of 50000 with 8 threads
2024-07-20 07:41:01,717 [123945] [INFO] spurt: Temporal: Preparing batch 5/1105
2024-07-20 07:41:01,998 [123945] [INFO] spurt: Temporal: Unwrapping batch 5/1105
Processing batch of 50000 with 8 threads
2024-07-20 08:07:42,168 [123945] [INFO] spurt: Temporal: Preparing batch 6/1105
2024-07-20 08:07:42,609 [123945] [INFO] spurt: Temporal: Unwrapping batch 6/1105
Processing batch of 50000 with 8 threads
2024-07-20 08:23:06,637 [123945] [INFO] spurt: Temporal: Preparing batch 7/1105
2024-07-20 08:23:06,932 [123945] [INFO] spurt: Temporal: Unwrapping batch 7/1105

I don't think this was just a MacOS spawn-default problem- these numbers above are from the linux server.

I think there's a couple possible remedies.

Geoff noted that passing a larger chunksize could be better. That did speed it up, but not above the single thread case.
Possibly passing larger chunks to the processes
Possibly letting each process in the pool load its own data from a file/memmap and avoid the cost of pickling

piyushrpt commented 3 months ago

There is a good chance that things got slower after the logger was included. Multiprocessing could be recreating the logger for every call. I'm going to experiment by replacing the logger with print statements in the MCF bits of the code.

scottstanie commented 3 months ago

It doesn't look like changing logs to prints affects the speed, since "Multiple calls to [getLogger()](https://docs.python.org/3/library/logging.html#logging.getLogger) with the same name will return a reference to the same logger object."

But changing them to prints, testing with 2 workers, and adding a print(f"Done in {time.time() - t0}") between batches, I'm seeing

$ python -m spurt.workflows.emcf -w 2 -o emcf7 -i interferograms/
2024-07-22 11:36:55,401 [46307] [INFO] spurt: Using Hop3 Graph in time with 15 epochs.
2024-07-22 11:36:55,401 [46307] [INFO] spurt: Using existing tiles file: emcf_tmp/tiles.json
2024-07-22 11:36:55,402 [46307] [INFO] spurt: Processing tile: 1
2024-07-22 11:37:09,715 [46307] [INFO] spurt: Time steps: 39
2024-07-22 11:37:09,715 [46307] [INFO] spurt: Number of points: 495800
Temporal: Number of interferograms: 39
Temporal: Number of links: 1485205
Temporal: Number of cycles: 25

Temporal: Preparing batch 1/30
Temporal: Unwrapping batch 1/30
Processing batch of 50000 with 2 threads
Done in 9.48 seconds

Temporal: Preparing batch 2/30
Temporal: Unwrapping batch 2/30
Processing batch of 50000 with 2 threads
Done in 7.43 seconds

if i change the line for temporal processing

Pool(
-                processes=worker_count, maxtasksperchild=1
+                processes=worker_count
             ) as p:

it goes down to about a quart of a second

 python -m spurt.workflows.emcf -w 6 -o emcf7 -i interferograms/
2024-07-22 11:38:59,445 [52510] [INFO] spurt: Using Hop3 Graph in time with 15 epochs.
2024-07-22 11:38:59,446 [52510] [INFO] spurt: Using existing tiles file: emcf_tmp/tiles.json
2024-07-22 11:38:59,446 [52510] [INFO] spurt: Processing tile: 1
2024-07-22 11:39:13,931 [52510] [INFO] spurt: Time steps: 39
2024-07-22 11:39:13,931 [52510] [INFO] spurt: Number of points: 495800
Temporal: Number of interferograms: 39
Temporal: Number of links: 1485205
Temporal: Number of cycles: 25

Temporal: Preparing batch 1/30
Temporal: Unwrapping batch 1/30
Processing batch of 50000 with 6 threads
Done in 0.27 seconds

Temporal: Preparing batch 2/30
Temporal: Unwrapping batch 2/30
Processing batch of 50000 with 6 threads
Done in 0.24 seconds

this is still slightly slower than single thread with -w 1 (running in ~.15 to .20 per batch)

piyushrpt commented 3 months ago

Totally fine with that change. Will read up more on it. I suspect that parameter has to do with resource setup and if there are no limits, it may skip teardown of resources that can be reused.

piyushrpt commented 3 months ago

And the number 50000 is really to keep memory usage down. Can be easily bumped up to a larger number if there is enough memory available. Note that we run these on small instances.

scottstanie commented 3 months ago

yeah I'd wanna check what that change means for memory usage- I wouldn't want that to blow up RAM usage and still only partially fix the speed

piyushrpt commented 3 months ago

After some checking with our implementation, it looks like I may have copy-pasted the maxtasksperchild argument. It was meant to be used with spatial unwrapping where memory requirements are higher and not with temporal unwrapping. This is now already done with the ProcessPoolExecutor so is not needed for spatial unwrapping. Most temporal problems have zero for sum of residues and never use the solver. Apologies for slowing things down. I can make a PR for that change or we can fold it into any other changes you may have.

piyushrpt commented 3 months ago

Is this the part you were referring to @scottstanie ?

https://github.com/isce-framework/spurt/blob/21720e6ad6763d38c428c07331c8cfd395cdacd3/src/spurt/workflows/emcf/_solver.py#L264-L266

scottstanie commented 3 months ago

Is this the part you were referring to @scottstanie ?

https://github.com/isce-framework/spurt/blob/21720e6ad6763d38c428c07331c8cfd395cdacd3/src/spurt/workflows/emcf/_solver.py#L264-L266

Yes exactly, I missed the indentation there, but apparently it only sometimes locks things up

piyushrpt commented 3 months ago

I added it to the open PR

scottstanie commented 3 months ago

Thanks! closed by #32

isce-framework / spurt

Multiple workers is slower than one during termporal unwrapping #30