Open chunweiyuan opened 11 months ago
Yes, it looks like this is probably an issue with the dataloaders competing for resources. Do you know if the data loading is using multithreading under the hood? If so, there can be contention between the two dataloaders, and you will need to either manually limit the dataloaders to different cores, or share the same dataloader between Ray tasks/actors.
Hi @stephanie-wang,
Yes, I believe under-the-hood Dataloader uses Python's multiprocessing package. The high-level instruction for the num_workers
is here.
The default multiprocessing context, in Unix, seems to be fork
link. On my end I've experimented with various permutations of num_workers=0 or 1
and multiprocessing_context="fork" or "spawn" or "forkserver"
, and none of them solves the bottleneck of n_tasks > 1
.
I have played with os.sched_setaffinity()
like this within my Ray task:
cpu_ids = list(os.sched_getaffinity(0))
print(seed, os.sched_getaffinity(0))
print(f"setting seed {seed} to use {cpu_ids[seed]}")
os.sched_setaffinity(0, [cpu_ids[seed]])
print(f"now seed {seed} uses cpu_ids: ", os.sched_getaffinity(0))
and obtained the following
(train_task pid=1380821) 1 {96, 98, 100, 102, 40, 42, 44, 46, 48, 50, 92, 94}
(train_task pid=1380821) setting seed 1 to use 98
(train_task pid=1380821) now seed 1 uses cpu_ids: {98}
(train_task pid=1380821) [rank: 0] Seed set to 1
(train_task pid=1380820) [rank: 0] Seed set to 0
(train_task pid=1380820) 0 {96, 98, 100, 102, 40, 42, 44, 46, 48, 50, 92, 94}
(train_task pid=1380820) setting seed 0 to use 96
(train_task pid=1380820) now seed 0 uses cpu_ids: {96}
but then everything moves even slower, whether I set num_workers=0 or 1
:
Epoch 0: 1%| | 1/161 [01:10<3:08:37, 0.01it/s, v_num=5.98e+7, train_loss_step=301.0]
Epoch 0: 1%| | 1/161 [01:10<3:08:48, 0.01it/s, v_num=5.98e+7, train_loss_step=331.0]
Epoch 0: 1%| | 2/161 [02:21<3:07:35, 0.01it/s, v_num=5.98e+7, train_loss_step=394.0]
Epoch 0: 1%| | 2/161 [02:21<3:08:01, 0.01it/s, v_num=5.98e+7, train_loss_step=320.0]
Not sure if the root cause of the problem is exactly this, but I have tried some of their suggestions, to no avail.
Hi @chunweiyuan I tried out your script on my laptop and wasn't able to reproduce with num_tasks=2 and 4. The only changes I made were to the ray.init
line (my machine did not have enough memory, and setting num_cpus
manually should not be necessary).
ray.init(include_dashboard=False,
#num_cpus=n_tasks + 1, # just some number >= n_tasks
#object_store_memory=50 * 1e9, # large enough
runtime_env=runtime_env)
Can you provide more details on the machine that you're running on? Perhaps there is just not enough physical compute.
Hi @stephanie-wang,
Thank you very much for your reply. I think you're onto something. Here's my long-winded followup, please bear with me:
When I test out the script (n_tasks>=2
) on my work laptop (MacBook Pro 2023, Apple M2 Pro, 16 GB, Sonoma 14.2.1, osx-arm64), I get some nice >12 it/s training rates for each task:
Epoch 0: 3%|▎ | 5/161 [00:00<00:12, 12.41it/s, v_num=3, train_loss_step=384.0]
Epoch 0: 27%|██▋ | 43/161 [00:03<00:09, 11.97it/s, v_num=3, train_loss_step=130.0]
But when I run it on a cluster node (Intel Xeon Gold 6230, 768 GB, x86_64, GNU/Linux, ubuntu 20.04.6, slurm 23.02.5), I get ~0.1 it/s again:
Epoch 0: 1%| | 2/161 [00:19<25:44, 0.10it/s, v_num=6.16e+7, train_loss_step=394.0]
Epoch 0: 1%| | 2/161 [00:20<27:01, 0.10it/s, v_num=6.16e+7, train_loss_step=320.0]
In both cases, I use conda 23.11.0, and installed my environments with the following steps:
conda create -n tft python=3.10 numpy xarray pandas cython netcdf4 scipy
conda activate tft
pip install pytorch-forecasting
pip install -U 'ray[default]'
I looked at my two environments (MacBook vs. cluster), and I do notice that on typing conda list openmp
, the MacBook env has
# Name Version Build Channel
llvm-openmp 14.0.6 hc6e5704_0
whereas the cluster env has
# Name Version Build Channel
_openmp_mutex 4.5 2_kmp_llvm conda-forge
llvm-openmp 14.0.6 h9e868ea_0
I wonder if the discrepancy comes from the issue mentioned here? Following some of the suggestions within, I rebuilt my env on the cluster. Now conda list openmp
shows
# Name Version Build Channel
_openmp_mutex 5.1 1_gnu
intel-openmp 2023.1.0 hdb19cb5_46306
and running the script with n_tasks=2
yields
Epoch 0: 2%|▏ | 3/161 [00:01<01:14, 2.11it/s, v_num=6.16e+7, train_loss_step=368.0]
Epoch 0: 1%| | 2/161 [00:01<01:34, 1.68it/s, v_num=6.16e+7, train_loss_step=320.0]
whereas n_tasks=4
gives
Epoch 0: 2%|▏ | 4/161 [00:04<02:48, 0.93it/s, v_num=6.16e+7, train_loss_step=298.0]
Epoch 0: 2%|▏ | 3/161 [00:04<03:49, 0.69it/s, v_num=6.16e+7, train_loss_step=369.0]
While it is improvement, it does show that the total throughput is ~3.7 it/s (the n_tasks=1
case), and that adding workers just linearly scales down that rate. This is true whether I set num_workers=0 or 1
, and it still indicates competition between worker
Long story short, these are my new finds thus far:
openmp
packages, but the impedance remains.I wonder if you observe the same trends on your end?
Not sure if I can do much else here since I cannot reproduce, but I'd try checking a few things:
ray start --head
before running any Ray script. The Ray script will connect automatically to the raylet server.Hi @stephanie-wang,
Did you try it on a Mac ARM notebook as well? Were you able to test this on an Intel CPUs, perhaps on a cluster?
I've tried step (2) you suggested, to no avail. I think (1) and (3) require a little more time, which I will investigate later. I've also tried installing/removing different openmp/llvm packages from my environment, which seem to have some effect, but the impedance issue remains.
I don't expect to have anything meaningful to add to this ticket over the next few days. How should we handle its status?
Did you try it on a Mac ARM notebook as well? Were you able to test this on an Intel CPUs, perhaps on a cluster?
I tested on Linux with Intel. Most likely it has something to do with the runtime env.
We can keep the ticket open for now if you find out more information, but there is not much we can do without a repro. Feel free to remove the needs-repro
label once you have something.
Hi @stephanie-wang,
Sounds good. Do you mind sharing your environment information (python, unix, etc.) with me? I would like to, if possible, duplicate your environment as much as I could. Thanks.
Hi @stephanie-wang, you mentioned that you tested the code on an Intel machine. Would it be possible to share the CPU info, and the version of openmp packages in your env? On my end I have different Xeon Golds to choose from. Would like to replicate your run to the best I can. Thanks.
@iycheng could you try to reproduce it on Mac as oncall?
What happened + What you expected to happen
To collect statistics I'd like to run many TemporalFusionTransformer pipelines with different random number seeds. Using Ray to parallelize these tasks leads to a situation where the remote dataloaders seem to impede each other, leading to very slow training. I only have access to cpus.
Running the code snippet below (the first 80 lines copied from here) with
n_tasks = 1
shows training occurring faster than 3 it/s:But once
n_tasks > 1
loading drops to <= 0.1 it/s. Here's what it looks like forn_tasks = 2
:Completely removing Ray from the code, running it serially, returns training to > 3 it/s.
Might the concurrent dataloaders all be competing each other for resources on the driver process, instead of using their own worker process? I've also tried using Actors instead of Tasks, but the problem persists. Closest issue I've found thus far is this one, which is still open. Adding
multiprocessing_context="fork"
, as indicated here (withnum_workers=1
), shows no effect.Versions / Dependencies
ubuntu 20.04.6 slurm 23.02.5
lightning 2.1.2 python 3.10.13 pytorch-forecasting 1.0.0 pytorch-lightning 2.1.2 pytorch-optimizer 2.12.0 ray 2.9.0
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.