Given that the multi-threading performance of MXNet looks pretty limited, we only have two choices that can fully utilize all the processors:
use a local machine distributed training method.
switch our religion to tensorflow.
However, the problem with choice 2 is that tensorflow does not seem to be flexible and readable enough.
So we may choose the first one.
The major work to do this is to modify the dmlc_local.py in order to support user created worker process as multiprocessing.Process.
I can investigate it later after finished tuning a continuous control RL algorithm.
Given that the multi-threading performance of MXNet looks pretty limited, we only have two choices that can fully utilize all the processors:
However, the problem with choice 2 is that tensorflow does not seem to be flexible and readable enough.
So we may choose the first one. The major work to do this is to modify the dmlc_local.py in order to support user created worker process as multiprocessing.Process. I can investigate it later after finished tuning a continuous control RL algorithm.