ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.2k stars 5.81k forks source link

[rllib][gcs][placementgroups] instability issues running tune/rllib #18003

Closed AmeerHajAli closed 3 years ago

AmeerHajAli commented 3 years ago

When I run rllib on ray 1.5.2: 1) the resource demands stay even after the application finishes, for example, I still see the following resource demands (for a few minutes) from the scheduler even after the job prints (pid=191) 2021-08-22 10:45:21,492 INFO tune.py:550 -- Total run time: 1095.71 seconds (1094.69 seconds for the tuning loop). :

Demands:
 {'CPU_group_8eb7d5e8a4ed413432db93d0b79b3e67': 1.0}: 96+ pending tasks/actors
 {'GPU_group_16cd93bbf7607454e10fb4e3334f5da6': 0.001, 'GPU_group_0_16cd93bbf7607454e10fb4e3334f5da6': 0.001}: 1+ pending tasks/actors
 {'GPU_group_1431a0326b37900afe3595513b2e1818': 0.001, 'GPU_group_0_1431a0326b37900afe3595513b2e1818': 0.001}: 1+ pending tasks/actors
 {'CPU': 1.0, 'GPU': 1.0} * 1, {'CPU': 1.0} * 128 (PACK): 1+ pending placement groups

2) RLLIB prints a lot of verbose resources:

(pid=191) == Status ==
(pid=191) Memory usage on this node: 6.1/31.4 GiB
(pid=191) Using FIFO scheduling algorithm.
(pid=191) Resources requested: 0/296 CPUs, 0/8 GPUs, 0.0/787.44 GiB heap, 0.0/338.81 GiB objects (0.0/1.0 CPU_group_15_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_2_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_0_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_4_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_6_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 GPU_group_0_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 GPU_group_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_12_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_13_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_10_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_1_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_7_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_9_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_3_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_11_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_8_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_14_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_5_8c84f56bef40324a35f6e63418c2a54d, 0.0/129.0 CPU_group_8c84f56bef40324a35f6e63418c2a54d, 0.0/8.0 accelerator_type:T4, 0.0/1.0 CPU_group_116_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_127_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_117_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_119_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_121_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_113_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_124_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_123_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_118_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_115_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_126_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_120_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_114_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_125_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_122_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_112_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_128_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_83_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_85_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_94_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_87_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_90_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_84_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_88_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_82_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_89_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_91_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_92_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_86_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_80_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_93_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_81_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_95_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_100_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_97_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_103_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_108_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_98_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_104_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_111_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_102_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_96_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_99_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_110_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_101_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_106_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_109_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_105_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_107_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_43_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_33_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_36_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_32_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_34_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_35_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_37_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_40_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_39_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_42_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_45_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_44_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_41_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_46_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_47_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_38_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_21_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_18_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_28_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_16_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_19_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_25_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_20_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_27_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_17_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_24_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_22_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_26_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_23_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_30_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_31_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_29_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_71_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_72_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_76_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_68_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_79_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_78_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_70_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_69_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_67_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_65_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_64_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_75_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_66_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_74_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_73_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_77_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_52_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_48_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_63_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_56_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_54_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_62_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_55_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_59_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_51_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_53_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_57_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_58_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_50_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_49_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_61_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_60_8c84f56bef40324a35f6e63418c2a54d)

3) RLLIB requests a lot of resources sometimes, and if the cluster cannot scale up to accommodate it ends up adding nodes and removing them for being idle and hanging forever. (e.g., it requests resources that should run on 200 nodes, but the cluster can scale only to 10 nodes, so it keeps adding 10 nodes and removing them while the trials says “pending”).

4) I think we should have e2e tests of rllib with GPUs, this might be already existing but for some reason, I am not able for example to run (the cluster keeps adding and removing nodes like issue 3) : ANYSCALE_DEBUG=1 RAY_ADDRESS=anyscale://timeout_fix_cluster_final2_aws?cluster_env=riot:5 rllib train -f ../ray/rllib/tuned_examples/compact-regression-test.yaml or ANYSCALE_DEBUG=1 RAY_ADDRESS=anyscale://timeout_fix_cluster_final2_aws?cluster_env=riot:5 rllib train -f ../ray/rllib/tuned_examples/impala/atari-impala-large.yaml

5) when I run ANYSCALE_DEBUG=1 RAY_ADDRESS=anyscale://timeout_fix_cluster_final2_aws?cluster_env=riot:5 rllib train -f ../ray/rllib/tuned_examples/compact-regression-test.yaml I get a lot of:

A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Task ID: ffffffffffffffffa70b3f9b10676c460808312e01000000 Worker ID: 1d806191d3304d0dbcc5fabedf3eefd9e6f12694227b34ae602c0203 Node ID: 3d02b42b39be8dbcd291b2611f9c36841f00f38e98c599c55ecfe827 Worker IP address: 192.168.75.4 Worker port: 10059 Worker PID: 446844
(pid=237) 2021-08-22 13:08:42,288   ERROR trial_runner.py:773 -- Trial APEX_BreakoutNoFrameskip-v4_95b82_00015: Error processing event.
(pid=237) Traceback (most recent call last):
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 739, in _process_trial
(pid=237)     results = self.trial_executor.fetch_result(trial)
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 729, in fetch_result
(pid=237)     result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 82, in wrapper
(pid=237)     return func(*args, **kwargs)
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1564, in get
(pid=237)     raise value.as_instanceof_cause()
(pid=237) ray.exceptions.RayTaskError: ray::APEX.train_buffered() (pid=220341, ip=192.168.75.4)
(pid=237)   File "python/ray/_raylet.pyx", line 534, in ray._raylet.execute_task
(pid=237)   File "python/ray/_raylet.pyx", line 484, in ray._raylet.execute_task.function_executor
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/function_manager.py", line 563, in actor_method_executor
(pid=237)     return method(__ray_actor, *args, **kwargs)
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 178, in train_buffered
(pid=237)     result = self.train()
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 640, in train
(pid=237)     raise e
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 629, in train
(pid=237)     result = Trainable.train(self)
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 237, in train
(pid=237)     result = self.step()
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 170, in step
(pid=237)     res = next(self.train_exec_impl)
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 756, in __next__
(pid=237)     return next(self.built_iterator)
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
(pid=237)     for item in it:
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
(pid=237)     for item in it:
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
(pid=237)     for item in it:
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
(pid=237)     for item in it:
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
(pid=237)     for item in it:
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
(pid=237)     for item in it:
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 1075, in build_union
(pid=237)     item = next(it)
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 756, in __next__
(pid=237)     return next(self.built_iterator)
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
(pid=237)     for item in it:
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
(pid=237)     for item in it:
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
(pid=237)     for item in it:
(pid=237)   [Previous line repeated 1 more time]
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
(pid=237)     for item in it:
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 551, in base_iterator
(pid=237)     batch = ray.get(obj_ref)
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 82, in wrapper
(pid=237)     return func(*args, **kwargs)
(pid=237) ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

CC @wuisawesome

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS):

Reproduction (REQUIRED)

Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):

If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".

krfricke commented 3 years ago

Thank you so much for discovering these. I'll investigate further and might create child issues to track these individually after finding out the cause.

AmeerHajAli commented 3 years ago

@krfricke , after running compact-regression-test.yaml, I am also getting:

Traceback (most recent call last):
  File "/Users/ameerhajali/anaconda3/envs/ray/bin/rllib", line 8, in <module>
    sys.exit(cli())
  File "/Users/ameerhajali/anaconda3/envs/ray/lib/python3.7/site-packages/ray/rllib/scripts.py", line 34, in cli
    train.run(options, train_parser)
  File "/Users/ameerhajali/anaconda3/envs/ray/lib/python3.7/site-packages/ray/rllib/train.py", line 255, in run
    concurrent=True)
  File "/Users/ameerhajali/anaconda3/envs/ray/lib/python3.7/site-packages/ray/tune/tune.py", line 624, in run_experiments
    _remote=False))
  File "/Users/ameerhajali/anaconda3/envs/ray/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 81, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/Users/ameerhajali/anaconda3/envs/ray/lib/python3.7/site-packages/ray/util/client/api.py", line 42, in get
    return self.worker.get(vals, timeout=timeout)
  File "/Users/ameerhajali/anaconda3/envs/ray/lib/python3.7/site-packages/ray/util/client/worker.py", line 225, in get
    res = self._get(obj_ref, op_timeout)
  File "/Users/ameerhajali/anaconda3/envs/ray/lib/python3.7/site-packages/ray/util/client/worker.py", line 244, in _get
    err = cloudpickle.loads(data.error)
ModuleNotFoundError: No module named 'tblib'
krfricke commented 3 years ago
  1. I will look into this today
  2. Should be fixed in latest master
  3. 18018

  4. see 3.
  5. Might be unrelated to RLLib, but I'll look into this toda7
  6. This is probably not related to RLLib, too, as it occurs during general error handling (pickling an exception). Still it's unclear why there is a dependency mismatch here. If I see it, I'll try to figure out what's going on. It might be helpful to provide your environment information (pip freeze -l) here
krfricke commented 3 years ago

I can't repro 5 and 6. Does this come up immediately? (It ran for ~1.5 hours without any problems). If it still comes up for you, can you post some local environment information (Python version and pip freeze -l)?

AmeerHajAli commented 3 years ago
(ray) ~/Desktop> pip freeze -l
aiobotocore==1.2.2
aiodataloader==0.2.0
aiofiles==0.5.0
aiohttp==3.7.4.post0
aiohttp-cors==0.7.0
aiohttp-middlewares==1.1.0
aioitertools==0.7.1
aiojobs==0.3.0
aiopg==1.2.0
aioredis==1.3.1
alabaster==0.7.12
alchemy-mock==0.4.3
alembic==1.5.2
aniso8601==7.0.0
anyio==2.2.0
anyscale==0.4.18
apipkg==1.5
appdirs==1.4.4
appnope==0.1.0
argon2==0.1.10
argon2-cffi==20.1.0
asgiref==3.3.1
astroid==2.5.6
async-exit-stack==1.0.1
async-generator==1.10
async-timeout==3.0.1
asyncache==0.1.1
asyncpg==0.21.0
asynctest==0.13.0
attrs==20.3.0
aws==0.2.5
aws-sam-translator==1.28.1
aws-xray-sdk==2.6.0
awscli==1.19.62
awspricing==2.0.3
Babel==2.9.0
backcall==0.2.0
backoff==1.10.0
bcrypt==3.1.7
beautifulsoup4==4.9.1
black==19.10b0
bleach==3.1.5
blessings==1.7
blis==0.7.4
boto==2.49.0
boto3==1.16.52
botocore==1.19.52
cachetools==4.2.0
caffeinate==0.1.0
catalogue==1.0.0
certifi==2020.12.5
cffi==1.14.4
cfgv==3.2.0
cfn-lint==0.39.0
chardet==3.0.4
click==7.1.2
cliff==3.6.0
cloudpickle==1.6.0
cmaes==0.7.1
cmd2==1.5.0
cmdstanpy==0.9.68
colorama==0.4.4
coloredlogs==15.0
colorful==0.5.4
colorlog==4.7.2
colorthief==0.2.1
commonmark==0.8.1
conda-pack==0.6.0
ConfigArgParse==1.4
convertdate==2.3.2
coverage==5.3.1
cryptography==3.3.1
cycler==0.10.0
cymem==2.0.5
Cython==0.29
dask==2021.4.0
databases==0.4.2
dataclasses==0.6
decorator==4.4.2
defusedxml==0.6.0
Deprecated==1.2.12
distlib==0.3.1
dm-tree==0.1.6
dnspython==2.1.0
docker==4.4.1
docspec==0.2.1
docspec-python==0.2.0
docutils==0.14
ecdsa==0.14.1
email-validator==1.1.2
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz
entrypoints==0.3
ephem==3.7.7.1
execnet==1.7.1
expiringdict==1.1.4
fabric==2.5.0
fastapi==0.59.0
filelock==3.0.12
flake8==3.8.4
flake8-alfred==1.1.1
flake8-import-order==0.18.1
flake8-polyfill==1.0.2
flake8-quotes==3.2.0
Flask==1.1.4
Flask-BasicAuth==0.2.0
Flask-Cors==3.0.10
flask-pytest==0.0.5
Flask-RESTful==0.3.8
flatbuffers==1.12
freezegun==1.1.0
fsspec==2021.6.1
future==0.18.2
gensim==3.8.3
gevent==21.1.2
geventhttpclient==1.4.4
gitdb==4.0.5
GitPython==3.1.3
google==3.0.0
google-api-core==1.25.0
google-api-python-client==1.12.8
google-auth==1.24.0
google-auth-httplib2==0.0.4
google-auth-oauthlib==0.4.2
google-cloud==0.34.0
google-cloud-billing==1.1.0
google-cloud-core==1.5.0
google-cloud-iam==2.0.0
google-cloud-resource-manager==0.30.3
googleapis-common-protos==1.52.0
gpustat==0.6.0
graphene==2.1.8
graphql-core==2.3.2
graphql-relay==2.0.1
greenlet==1.0.0
grimp==1.2.3
grpc-google-iam-v1==0.12.3
grpc-stubs==1.24.3
grpcio==1.35.0
grpcio-tools==1.35.0
gym==0.18.0
h11==0.9.0
hijri-converter==2.1.1
hiredis==2.0.0
holidays==0.11.1
httplib2==0.18.1
httptools==0.1.1
humanfriendly==9.1
hurry.filesize==0.9
identify==2.2.4
idna==2.10
imagesize==1.2.0
import-linter==1.2.1
importlib-metadata==4.0.1
iniconfig==1.1.1
invoke==1.4.1
ipykernel==5.3.4
ipython==7.17.0
ipython-genutils==0.2.0
iso8601==0.1.14
isort==5.8.0
itsdangerous==1.1.0
jedi==0.17.2
Jinja2==2.11.2
jmespath==0.10.0
joblib==1.0.0
json5==0.9.5
jsondiff==1.2.0
jsonpatch==1.28
jsonpickle==1.4.1
jsonpointer==2.0
jsonschema==3.2.0
junit-xml==1.9
jupyter-client==6.1.6
jupyter-core==4.6.3
jupyter-packaging==0.7.12
jupyter-server==1.4.1
jupyterlab==3.0.12
jupyterlab-server==2.3.0
kiwisolver==1.3.1
kopf==1.32.1
korean-lunar-calendar==0.2.1
kubernetes==17.17.0
kubernetes-asyncio==12.0.1
launchdarkly-server-sdk==6.13.1
lazy-object-proxy==1.6.0
libcst==0.3.16
libhoney==1.9.0
locket==0.2.1
locust==1.4.3
LunarCalendar==0.0.9
lz4==3.1.3
Mako==1.1.4
MarkupSafe==1.1.1
matplotlib==3.3.4
mccabe==0.6.1
mistune==0.8.4
mock==1.0.1
modin==0.10.0
more-itertools==8.7.0
moto==1.3.16
msgpack==1.0.2
multidict==5.1.0
murmurhash==1.0.5
mypy==0.790
mypy-extensions==0.4.3
nbclassic==0.2.6
nbconvert==5.6.1
nbformat==5.0.7
networkx==2.5.1
nltk==3.6.2
nodeenv==1.6.0
notebook==6.0.3
npm==0.1.1
nr.collections==0.0.1
nr.databind.core==0.0.22
nr.databind.json==0.0.14
nr.fs==1.6.3
nr.interface==0.0.5
nr.metaclass==0.0.6
nr.parsing.date==0.6.1
nr.pylang.utils==0.0.4
nr.stream==0.0.5
nr.utils.re==0.1.1
numpy==1.19.5
nvidia-ml-py3==7.352.0
oauth2client==3.0.0
oauthlib==3.1.0
onelogin==2.0.2
opencensus==0.7.12
opencensus-context==0.1.2
opencv-python-headless==4.3.0.36
opentelemetry-api==1.4.1
opentelemetry-exporter-otlp==0.17b0
opentelemetry-exporter-otlp-proto-grpc==1.4.1
opentelemetry-ext-asgi==0.11b0
opentelemetry-ext-asyncpg==0.11b0
opentelemetry-ext-botocore==0.11b0
opentelemetry-ext-honeycomb==0.5b0
opentelemetry-instrumentation==0.23b2
opentelemetry-instrumentation-asgi==0.17b0
opentelemetry-instrumentation-asyncpg==0.17b0
opentelemetry-instrumentation-botocore==0.17b0
opentelemetry-instrumentation-sqlalchemy==0.17b0
opentelemetry-instrumentation-starlette==0.17b0
opentelemetry-proto==1.4.1
opentelemetry-sdk==1.4.1
opentelemetry-semantic-conventions==0.23b2
optional-django==0.1.0
optuna==2.5.0
orjson==3.4.7
packaging==20.8
pandas==1.2.4
pandoc==1.0.2
pandocfilters==1.4.2
paramiko==2.7.1
parso==0.7.1
partd==1.1.0
pathspec==0.8.1
pbr==5.5.1
pep8-naming==0.11.1
pexpect==4.8.0
pickle5==0.0.11
pickleshare==0.7.5
Pillow==7.2.0
pip-tools==5.5.0
plac==1.1.3
plotly==4.14.3
pluggy==0.13.1
ply==3.11
postgres==3.0.0
pre-commit==2.12.1
preshed==3.0.5
prettytable==0.7.2
prometheus-client==0.10.1
promise==2.3
prompt-toolkit==3.0.6
prophet==1.0.1
proto-plus==1.13.0
protobuf==3.15.3
psutil==5.8.0
psycopg2-binary==2.8.6
psycopg2-pool==1.1
ptyprocess==0.6.0
py==1.10.0
py-spy==0.3.5
pyaml==20.4.0
pyarrow==3.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybase62==0.4.3
pycodestyle==2.6.0
pycparser==2.20
pydantic==1.8.1
pydata-sphinx-theme==0.4.3
pydoc-markdown==3.13.0
pydocstyle==5.0.2
pyflakes==2.2.0
PyGithub==1.55
pyglet==1.5.0
Pygments==2.3.1
PyJWT==2.1.0
pylama==7.7.1
pylint==2.8.2
PyMeeus==0.5.11
PyNaCl==1.4.0
pynput==1.7.3
pyobjc-core==7.3
pyobjc-framework-Cocoa==7.3
pyobjc-framework-Quartz==7.3
pyparsing==2.4.7
pyperclip==1.8.1
pyRFC3339==1.1
pyrsistent==0.17.3
pystan==2.19.1.1
pytest==6.2.1
pytest-aiohttp==0.3.0
pytest-asyncio==0.14.0
pytest-azurepipelines==0.8.0
pytest-cov==2.11.1
pytest-flask==1.0.0
pytest-forked==1.3.0
pytest-timeout==1.4.2
pytest-tornado==0.8.1
pytest-xdist==2.2.0
python-dateutil==2.8.1
python-editor==1.0.4
python-engineio==3.14.2
python-jose==3.2.0
python-json-logger==2.0.1
python-multipart==0.0.5
python-socketio==4.6.0
python3-wget==0.0.2b1
pytz==2020.5
PyYAML==5.4.1
pyzmq==19.0.2
ray==1.5.2
readthedocs-sphinx-ext==1.0.4
recommonmark==0.5.0
redis==3.5.0
regex==2021.4.4
requests==2.25.1
requests-oauthlib==1.3.0
responses==0.12.0
retrying==1.3.3
rsa==4.7
Rx==1.6.1
s3fs==2021.6.1
s3transfer==0.3.7
sacremoses==0.0.43
scalesec-gcp-workload-identity==1.0.7
scikit-learn==0.23.2
scikit-optimize==0.8.1
scipy==1.5.4
semver==2.13.0
Send2Trash==1.5.0
sentencepiece==0.1.95
sentry-sdk==1.1.0
setuptools-git==1.2
six==1.15.0
sklearn==0.0
smart-open==5.1.0
smmap==3.0.4
sniffio==1.2.0
snowballstemmer==2.0.0
soupsieve==2.0.1
spacy==2.3.5
Sphinx==3.0.4
sphinx-book-theme==0.0.39
sphinx-click==2.5.0
sphinx-copybutton==0.3.1
sphinx-gallery==0.8.2
sphinx-jsonschema==1.16.7
sphinx-tabs==2.0.1
sphinx-version-warning==1.1.2
sphinxcontrib-applehelp==1.0.2
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==1.0.3
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.4
sphinxcontrib-websupport==1.2.4
sphinxcontrib.yt==0.2.2
sphinxemoji==0.1.8
SQLAlchemy==1.4.0b1
sqlalchemy-stubs==0.4
srsly==1.0.5
sshpubkeys==3.1.0
starlette==0.13.4
statsd==3.3.0
stevedore==3.3.0
svgwrite==1.4.1
tabulate==0.8.7
tensorboardX==2.1
terminado==0.8.3
testfixtures==6.15.0
testpath==0.4.4
texthero==1.0.9
thinc==7.4.5
threadpoolctl==2.1.0
tokenizers==0.8.1rc2
toml==0.10.2
toolz==0.11.1
torch==1.7.1
torchvision==0.8.2
tornado==6.1
tqdm==4.56.0
traitlets==4.3.3
transformers==3.1.0
tune-sklearn==0.2.1
typed-ast==1.4.2
typer==0.3.2
typing-extensions==3.10.0.0
typing-inspect==0.6.0
ujson==3.2.0
Unidecode==1.2.0
uritemplate==3.0.1
urllib3==1.26.2
uvicorn==0.11.8
uvloop==0.14.0
virtualenv==20.4.4
vulture==2.3
wasabi==0.8.2
watchdog==1.0.2
wcwidth==0.1.9
webencodings==0.5.1
websocket-client==0.57.0
websockets==8.1
Werkzeug==1.0.1
wordcloud==1.8.1
wrapt==1.12.1
xgboost==1.4.2
xgboost-ray==0.1.1
xmltodict==0.12.0
yapf==0.23.0
yarl==1.6.3
yaspin==1.0.0
zipp==3.4.1
zope.event==4.5.0
zope.interface==5.3.0

python 3.7 I think it is straight forward to repro if you run against a session in the product with the default cluster compute.

AmeerHajAli commented 3 years ago

CC @wuisawesome, I think the placement groups are potentially leaking or not being cleaned up appropriately.

rkooo567 commented 3 years ago

I believe this should be fixed int he master. Please reopen if you see the issue again