microsoft / CameraTraps

PyTorch Wildlife: a Collaborative Deep Learning Framework for Conservation.
https://cameratraps.readthedocs.io/en/latest/
MIT License
781 stars 246 forks source link

Limit number of cores used #286

Closed Wytamma closed 2 years ago

Wytamma commented 2 years ago

I'm using the batch detector on a HPC and need to limit the number of cores used to the number requested. If I used the --cores the batch detector creates a pool of workers which uses a lot more memory and does not support logging or checkpointing. Is there a way to limit the number of cores used without creating a workerpool? I think it's related to set_inter_op_parallelism_threads, but I'm not sure how to set tensorflow config from run_tf_detector_batch.py.

agentmorris commented 2 years ago

Thanks for reaching out! And a great question, I honestly didn't know until this email thread that with --ncores set to 1 (the default), multiple CPUs get used at all, but I was able to verify that (at least on Windows) this is indeed the case. I learned something new today!

I tried every trick the Internet had to offer re: limiting TensorFlow to a finite number of cores, but was unable to get anything less than all cores to be used on my Windows machine. Others asking similar questions seem to have hit the same wall. I haven't tried on Linux, and I assume that if you're on an HPC system, you're probably running Linux, so you may want to check out the repo branch I just created called "limit_cores", where I've included basically every suggestion for limiting TF to a single core at the top of this file:

https://github.com/microsoft/CameraTraps/blob/cpu_limit/detection/tf_detector.py

Specifically, with "num_cores=1" I tried all of the following (at the same time or separately):

Environment variables:

os.environ["OMP_NUM_THREADS"] = str(num_cores)
os.environ["TF_NUM_INTRAOP_THREADS"] = str(num_cores)
os.environ["TF_NUM_INTEROP_THREADS"] = str(num_cores)

The TF1 way of specifying this:

config = tf.ConfigProto(intra_op_parallelism_threads=num_cores,
                        inter_op_parallelism_threads=num_cores, 
                        allow_soft_placement=True,
                        device_count = {'CPU': num_cores})

...

self.tf_session = tf.Session(graph=detection_graph,config=config)

The TF2 way of specifying this:

tf.config.threading.set_inter_op_parallelism_threads(num_cores)                   
tf.config.threading.set_intra_op_parallelism_threads(num_cores)                   
tf.config.set_soft_device_placement(True)

No luck.

This is all likely somewhat system-specific, so I think it's worth running the code on that branch exactly as it is to see whether it limits TF to a single core on your HPC system. If that fixes it, we're good. If that doesn't fix it, we can share with you a pre-release version of MDv5, which uses PyTorch; this isn't inherently better or worse, but you will have a different set of tools to experiment with for managing CPU utilization.

agentmorris commented 2 years ago

BTW adding checkpoints for parallelized jobs (ncores > 1) wouldn't be that big a deal, but I don't think that would actually help you here; as I understand it, your issue is that you want to limit the number of cores used, so I think the issue of checkpointing isn't related.

Wytamma commented 2 years ago

Hi @agentmorris! Thanks for the reply. Yep the HPC is running Linux. Your solution kinda works for me. As far as I can tell tf.ConfigProto(...) and tf.config.threading... don't work. However setting the env vars has some effect. It seems that the number of cpu used is limit to around 1.5 * the value of OMP_NUM_THREADS. I've adjusted my code to set OMP_NUM_THREADS etc. To the number of cpus requested // 2. Thanks again!

Wytamma commented 2 years ago

Actually... with this configuration the number of cores used never goes above requested amount... config and env vars are controlling different settings?

os.environ["OMP_NUM_THREADS"] = str(requested_cores)
os.environ["TF_NUM_INTRAOP_THREADS"] = str(requested_cores)
os.environ["TF_NUM_INTEROP_THREADS"] = str(requested_cores)

config = tf.ConfigProto(intra_op_parallelism_threads=1,
                        inter_op_parallelism_threads=1, 
                        allow_soft_placement=True,
                        device_count = {'CPU': 1})
agentmorris commented 2 years ago

Great, glad we found a solution.

For posterity, can you confirm that you are in fact passing that config object when you create the session? I.e.:

self.tf_session = tf.Session(graph=detection_graph,config=config)
Wytamma commented 2 years ago

Yep without passing the config object it uses ~1.5 * OMP_NUM_THREADS cores

agentmorris commented 2 years ago

Great, closing this issue.