Closed scottstanie closed 3 months ago
There is a good chance that things got slower after the logger was included. Multiprocessing could be recreating the logger for every call. I'm going to experiment by replacing the logger with print statements in the MCF bits of the code.
It doesn't look like changing logs to prints affects the speed, since "Multiple calls to [getLogger()](https://docs.python.org/3/library/logging.html#logging.getLogger) with the same name will return a reference to the same logger object."
But changing them to prints, testing with 2 workers, and adding a print(f"Done in {time.time() - t0}")
between batches, I'm seeing
$ python -m spurt.workflows.emcf -w 2 -o emcf7 -i interferograms/
2024-07-22 11:36:55,401 [46307] [INFO] spurt: Using Hop3 Graph in time with 15 epochs.
2024-07-22 11:36:55,401 [46307] [INFO] spurt: Using existing tiles file: emcf_tmp/tiles.json
2024-07-22 11:36:55,402 [46307] [INFO] spurt: Processing tile: 1
2024-07-22 11:37:09,715 [46307] [INFO] spurt: Time steps: 39
2024-07-22 11:37:09,715 [46307] [INFO] spurt: Number of points: 495800
Temporal: Number of interferograms: 39
Temporal: Number of links: 1485205
Temporal: Number of cycles: 25
Temporal: Preparing batch 1/30
Temporal: Unwrapping batch 1/30
Processing batch of 50000 with 2 threads
Done in 9.48 seconds
Temporal: Preparing batch 2/30
Temporal: Unwrapping batch 2/30
Processing batch of 50000 with 2 threads
Done in 7.43 seconds
if i change the line for temporal processing
Pool(
- processes=worker_count, maxtasksperchild=1
+ processes=worker_count
) as p:
it goes down to about a quart of a second
python -m spurt.workflows.emcf -w 6 -o emcf7 -i interferograms/
2024-07-22 11:38:59,445 [52510] [INFO] spurt: Using Hop3 Graph in time with 15 epochs.
2024-07-22 11:38:59,446 [52510] [INFO] spurt: Using existing tiles file: emcf_tmp/tiles.json
2024-07-22 11:38:59,446 [52510] [INFO] spurt: Processing tile: 1
2024-07-22 11:39:13,931 [52510] [INFO] spurt: Time steps: 39
2024-07-22 11:39:13,931 [52510] [INFO] spurt: Number of points: 495800
Temporal: Number of interferograms: 39
Temporal: Number of links: 1485205
Temporal: Number of cycles: 25
Temporal: Preparing batch 1/30
Temporal: Unwrapping batch 1/30
Processing batch of 50000 with 6 threads
Done in 0.27 seconds
Temporal: Preparing batch 2/30
Temporal: Unwrapping batch 2/30
Processing batch of 50000 with 6 threads
Done in 0.24 seconds
this is still slightly slower than single thread with -w 1
(running in ~.15 to .20 per batch)
Totally fine with that change. Will read up more on it. I suspect that parameter has to do with resource setup and if there are no limits, it may skip teardown of resources that can be reused.
And the number 50000 is really to keep memory usage down. Can be easily bumped up to a larger number if there is enough memory available. Note that we run these on small instances.
yeah I'd wanna check what that change means for memory usage- I wouldn't want that to blow up RAM usage and still only partially fix the speed
After some checking with our implementation, it looks like I may have copy-pasted the maxtasksperchild
argument. It was meant to be used with spatial unwrapping where memory requirements are higher and not with temporal unwrapping. This is now already done with the ProcessPoolExecutor
so is not needed for spatial unwrapping. Most temporal problems have zero for sum of residues and never use the solver. Apologies for slowing things down. I can make a PR for that change or we can fold it into any other changes you may have.
Is this the part you were referring to @scottstanie ?
Is this the part you were referring to @scottstanie ?
Yes exactly, I missed the indentation there, but apparently it only sometimes locks things up
I added it to the open PR
Thanks! closed by #32
Single worker speed:
8 temporal workers
I don't think this was just a MacOS spawn-default problem- these numbers above are from the linux server.
I think there's a couple possible remedies.
chunksize
could be better. That did speed it up, but not above the single thread case.