Closed MrRace closed 3 years ago
Are you seeing this getting created consistently? I don't see it on my end when I run the script.
Are you seeing this getting created consistently? I don't see it on my end when I run the script.
Yes, it consistently happen when use --smoke-test true
Hmm this is strange. Did you make any changes to the pbt_transformers script?
Also if you try another simpler Tune example like the Quick Start on our docs(https://docs.ray.io/en/master/tune/index.html#quick-start) or the PBT Convnet example (python ray/python/ray/tune/example/pbt_convnet_function_example.py --smoke-test
) do you still see the file being created?
Hmm this is strange. Did you make any changes to the pbt_transformers script?
Also if you try another simpler Tune example like the Quick Start on our docs(https://docs.ray.io/en/master/tune/index.html#quick-start) or the PBT Convnet example (
python ray/python/ray/tune/example/pbt_convnet_function_example.py --smoke-test
) do you still see the file being created?
Use the example from the Quick Start on docs(https://docs.ray.io/en/master/tune/index.html#quick-start) also create the core file.
2020-09-24 15:32:02,669 INFO services.py:1166 -- View the Ray dashboard at http://127.0.0.1:8265
2020-09-24 15:32:02,672 WARNING services.py:1625 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
2020-09-24 15:32:04,109 WARNING function_runner.py:486 -- Function checkpointing is disabled. This may result in unexpected behavior when using checkpointing features or certain schedulers. To enable, set the train function arguments to be `func(config, checkpoint_dir=None)`.
2020-09-24 15:32:04,184 WARNING tune.py:396 -- Tune detects GPUs, but no trials are using GPUs. To enable trials to use GPUs, set tune.run(resources_per_trial={'gpu': 1}...) which allows Tune to expose 1 GPU to each trial. You can also override `Trainable.default_resource_request` if using the Trainable API.
2020-09-24 15:32:04,260 ERROR syncer.py:63 -- Log sync requires rsync to be installed.
== Status ==
Memory usage on this node: 42.6/376.2 GiB
Using FIFO scheduling algorithm.
Resources requested: 1/32 CPUs, 0/3 GPUs, 0.0/235.45 GiB heap, 0.0/72.36 GiB objects
Result logdir: /root/ray_results/training_function
Number of trials: 3 (2 PENDING, 1 RUNNING)
+-------------------------------+----------+-------+---------+--------+
| Trial name | status | loc | alpha | beta |
|-------------------------------+----------+-------+---------+--------|
| training_function_18f3e_00000 | RUNNING | | 0.001 | 3 |
| training_function_18f3e_00001 | PENDING | | 0.01 | 1 |
| training_function_18f3e_00002 | PENDING | | 0.1 | 2 |
+-------------------------------+----------+-------+---------+--------+
Result for training_function_18f3e_00000:
date: 2020-09-24_15-32-05
done: false
experiment_id: e8cfd1e018eb473abf9a68a83dc3e502
experiment_tag: 0_alpha=0.001,beta=3
hostname: dfc587d32eef
iterations_since_restore: 1
mean_loss: 10.3
neg_mean_loss: -10.3
node_ip: 172.17.0.3
pid: 20058
time_since_restore: 0.00024628639221191406
time_this_iter_s: 0.00024628639221191406
time_total_s: 0.00024628639221191406
timestamp: 1600961525
timesteps_since_restore: 0
training_iteration: 1
trial_id: 18f3e_00000
Result for training_function_18f3e_00002:
date: 2020-09-24_15-32-05
done: false
experiment_id: c1d18a3bfac24653bdcc759d3302f7b9
experiment_tag: 2_alpha=0.1,beta=2
hostname: dfc587d32eef
iterations_since_restore: 1
mean_loss: 10.2
neg_mean_loss: -10.2
node_ip: 172.17.0.3
pid: 20067
time_since_restore: 0.00028967857360839844
time_this_iter_s: 0.00028967857360839844
time_total_s: 0.00028967857360839844
timestamp: 1600961525
timesteps_since_restore: 0
training_iteration: 1
trial_id: 18f3e_00002
Result for training_function_18f3e_00001:
date: 2020-09-24_15-32-05
done: false
experiment_id: b3912ebc25914f15a6655d9dfc84b93f
experiment_tag: 1_alpha=0.01,beta=1
hostname: dfc587d32eef
iterations_since_restore: 1
mean_loss: 10.1
neg_mean_loss: -10.1
node_ip: 172.17.0.3
pid: 20055
time_since_restore: 0.00029921531677246094
time_this_iter_s: 0.00029921531677246094
time_total_s: 0.00029921531677246094
timestamp: 1600961525
timesteps_since_restore: 0
training_iteration: 1
trial_id: 18f3e_00001
== Status ==
Memory usage on this node: 42.7/376.2 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/32 CPUs, 0/3 GPUs, 0.0/235.45 GiB heap, 0.0/72.36 GiB objects
Result logdir: /root/ray_results/training_function
Number of trials: 3 (3 TERMINATED)
+-------------------------------+------------+-------+---------+--------+----------+--------+------------------+-----------------+
| Trial name | status | loc | alpha | beta | loss | iter | total time (s) | neg_mean_loss |
|-------------------------------+------------+-------+---------+--------+----------+--------+------------------+-----------------|
| training_function_18f3e_00000 | TERMINATED | | 0.001 | 3 | 10.291 | 10 | 0.0950978 | -10.291 |
| training_function_18f3e_00001 | TERMINATED | | 0.01 | 1 | 10.0108 | 10 | 0.150772 | -10.0108 |
| training_function_18f3e_00002 | TERMINATED | | 0.1 | 2 | 9.37431 | 10 | 0.138224 | -9.37431 |
+-------------------------------+------------+-------+---------+--------+----------+--------+------------------+-----------------+
Can you post the core file? I've never seen it and would like to know what's inside.
Can you post the core file? I've never seen it and would like to know what's inside.
It is a binary file, would you need?
@richardliaw The core file can be download from https://drive.google.com/file/d/13EoT0RMlJbLEZAfIFpUg32QCyOKzRc-b/view?usp=sharing
@MrRace you said you were running this on Docker right? Would you be able to try running the Quick Start example without using Docker to see if the core file still gets created? I've never seen this before so just trying to isolate the problem.
@amogkam I try run the Quick Start example outer the docker and it does not create the core file. The docker I used comes from pytorch/pytorch:1.6.0-cuda10.1-cudnn7-runtime
, maybe you can also get the reproduction. Looking forward to solve the unexpected problem. Best regards!
great catch! Can you try this: https://stackoverflow.com/questions/58704192/how-to-disable-core-file-dumps-in-docker-container
great catch! Can you try this: https://stackoverflow.com/questions/58704192/how-to-disable-core-file-dumps-in-docker-container
Therefore just disable the core file?Is there some other risk underlying ?
No security risk afaik. Did it work?
No security risk afaik. Did it work?
Yes, It can work. Thanks a lot!
Awesome!
I run the script from https://github.com/ray-project/ray/tree/master/python/ray/tune/examples/pbt_transformers,
I find a core file in script directory:
What is it? It confuse me. The logs print in terminal is below:
Is there something I do wrong so that comes the core file? Thanks a lot!