ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.11k stars 5.6k forks source link

[tune] what the core.4424 file is #10990

Closed MrRace closed 3 years ago

MrRace commented 3 years ago

I run the script from https://github.com/ray-project/ray/tree/master/python/ray/tune/examples/pbt_transformers,
I find a core file in script directory: image

What is it? It confuse me. The logs print in terminal is below:

2020-09-24 09:53:26.958514: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
wandb: WARNING W&B installed but not logged in.  Run `wandb login` or set the WANDB_API_KEY env variable.
args.smoke_test= True
2020-09-24 09:53:28,778 INFO services.py:1166 -- View the Ray dashboard at http://127.0.0.1:8265
2020-09-24 09:53:28,780 WARNING services.py:1625 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
Downloading and caching Tokenizer
Downloading and caching pre-trained model
Some weights of the model checkpoint at /home/data/pretrain_models/chinese-bert_chinese_wwm_pytorch were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /home/data/pretrain_models/chinese-bert_chinese_wwm_pytorch and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
task_data_dir=/home/vod_tag/hypersearch/test_data/RTE,task_name=rte
2020-09-24 09:53:33,136 ERROR syncer.py:63 -- Log sync requires rsync to be installed.
== Status ==
Memory usage on this node: 43.2/376.2 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 1/32 CPUs, 0/3 GPUs, 0.0/249.71 GiB heap, 0.0/76.56 GiB objects
Result logdir: /root/ray_results/tune_transformer_pbt
Number of trials: 1 (1 RUNNING)
+-------------------------------+----------+-------+-----------+-------------+----------------+--------------+
| Trial name                    | status   | loc   |   w_decay |          lr |   train_bs/gpu |   num_epochs |
|-------------------------------+----------+-------+-----------+-------------+----------------+--------------|
| train_transformer_cea40_00000 | RUNNING  |       |  0.108942 | 4.16792e-05 |             32 |            4 |
+-------------------------------+----------+-------+-----------+-------------+----------------+--------------+

== Status ==
Memory usage on this node: 46.4/376.2 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 0/32 CPUs, 0/3 GPUs, 0.0/249.71 GiB heap, 0.0/76.56 GiB objects
Result logdir: /root/ray_results/tune_transformer_pbt
Number of trials: 1 (1 TERMINATED)
+-------------------------------+------------+-------+-----------+-------------+----------------+--------------+
| Trial name                    | status     | loc   |   w_decay |          lr |   train_bs/gpu |   num_epochs |
|-------------------------------+------------+-------+-----------+-------------+----------------+--------------|
| train_transformer_cea40_00000 | TERMINATED |       |  0.108942 | 4.16792e-05 |             32 |            4 |
+-------------------------------+------------+-------+-----------+-------------+----------------+--------------+

== Status ==
Memory usage on this node: 46.4/376.2 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 0/32 CPUs, 0/3 GPUs, 0.0/249.71 GiB heap, 0.0/76.56 GiB objects
Result logdir: /root/ray_results/tune_transformer_pbt
Number of trials: 1 (1 TERMINATED)
+-------------------------------+------------+-------+-----------+-------------+----------------+--------------+
| Trial name                    | status     | loc   |   w_decay |          lr |   train_bs/gpu |   num_epochs |
|-------------------------------+------------+-------+-----------+-------------+----------------+--------------|
| train_transformer_cea40_00000 | TERMINATED |       |  0.108942 | 4.16792e-05 |             32 |            4 |
+-------------------------------+------------+-------+-----------+-------------+----------------+--------------+

Is there something I do wrong so that comes the core file? Thanks a lot!

amogkam commented 3 years ago

Are you seeing this getting created consistently? I don't see it on my end when I run the script.

MrRace commented 3 years ago

Are you seeing this getting created consistently? I don't see it on my end when I run the script.

Yes, it consistently happen when use --smoke-test true

amogkam commented 3 years ago

Hmm this is strange. Did you make any changes to the pbt_transformers script?

Also if you try another simpler Tune example like the Quick Start on our docs(https://docs.ray.io/en/master/tune/index.html#quick-start) or the PBT Convnet example (python ray/python/ray/tune/example/pbt_convnet_function_example.py --smoke-test) do you still see the file being created?

MrRace commented 3 years ago

Hmm this is strange. Did you make any changes to the pbt_transformers script?

Also if you try another simpler Tune example like the Quick Start on our docs(https://docs.ray.io/en/master/tune/index.html#quick-start) or the PBT Convnet example (python ray/python/ray/tune/example/pbt_convnet_function_example.py --smoke-test) do you still see the file being created?

Use the example from the Quick Start on docs(https://docs.ray.io/en/master/tune/index.html#quick-start) also create the core file.

2020-09-24 15:32:02,669 INFO services.py:1166 -- View the Ray dashboard at http://127.0.0.1:8265
2020-09-24 15:32:02,672 WARNING services.py:1625 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
2020-09-24 15:32:04,109 WARNING function_runner.py:486 -- Function checkpointing is disabled. This may result in unexpected behavior when using checkpointing features or certain schedulers. To enable, set the train function arguments to be `func(config, checkpoint_dir=None)`.
2020-09-24 15:32:04,184 WARNING tune.py:396 -- Tune detects GPUs, but no trials are using GPUs. To enable trials to use GPUs, set tune.run(resources_per_trial={'gpu': 1}...) which allows Tune to expose 1 GPU to each trial. You can also override `Trainable.default_resource_request` if using the Trainable API.
2020-09-24 15:32:04,260 ERROR syncer.py:63 -- Log sync requires rsync to be installed.
== Status ==
Memory usage on this node: 42.6/376.2 GiB
Using FIFO scheduling algorithm.
Resources requested: 1/32 CPUs, 0/3 GPUs, 0.0/235.45 GiB heap, 0.0/72.36 GiB objects
Result logdir: /root/ray_results/training_function
Number of trials: 3 (2 PENDING, 1 RUNNING)
+-------------------------------+----------+-------+---------+--------+
| Trial name                    | status   | loc   |   alpha |   beta |
|-------------------------------+----------+-------+---------+--------|
| training_function_18f3e_00000 | RUNNING  |       |   0.001 |      3 |
| training_function_18f3e_00001 | PENDING  |       |   0.01  |      1 |
| training_function_18f3e_00002 | PENDING  |       |   0.1   |      2 |
+-------------------------------+----------+-------+---------+--------+

Result for training_function_18f3e_00000:
  date: 2020-09-24_15-32-05
  done: false
  experiment_id: e8cfd1e018eb473abf9a68a83dc3e502
  experiment_tag: 0_alpha=0.001,beta=3
  hostname: dfc587d32eef
  iterations_since_restore: 1
  mean_loss: 10.3
  neg_mean_loss: -10.3
  node_ip: 172.17.0.3
  pid: 20058
  time_since_restore: 0.00024628639221191406
  time_this_iter_s: 0.00024628639221191406
  time_total_s: 0.00024628639221191406
  timestamp: 1600961525
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 18f3e_00000

Result for training_function_18f3e_00002:
  date: 2020-09-24_15-32-05
  done: false
  experiment_id: c1d18a3bfac24653bdcc759d3302f7b9
  experiment_tag: 2_alpha=0.1,beta=2
  hostname: dfc587d32eef
  iterations_since_restore: 1
  mean_loss: 10.2
  neg_mean_loss: -10.2
  node_ip: 172.17.0.3
  pid: 20067
  time_since_restore: 0.00028967857360839844
  time_this_iter_s: 0.00028967857360839844
  time_total_s: 0.00028967857360839844
  timestamp: 1600961525
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 18f3e_00002

Result for training_function_18f3e_00001:
  date: 2020-09-24_15-32-05
  done: false
  experiment_id: b3912ebc25914f15a6655d9dfc84b93f
  experiment_tag: 1_alpha=0.01,beta=1
  hostname: dfc587d32eef
  iterations_since_restore: 1
  mean_loss: 10.1
  neg_mean_loss: -10.1
  node_ip: 172.17.0.3
  pid: 20055
  time_since_restore: 0.00029921531677246094
  time_this_iter_s: 0.00029921531677246094
  time_total_s: 0.00029921531677246094
  timestamp: 1600961525
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 18f3e_00001

== Status ==
Memory usage on this node: 42.7/376.2 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/32 CPUs, 0/3 GPUs, 0.0/235.45 GiB heap, 0.0/72.36 GiB objects
Result logdir: /root/ray_results/training_function
Number of trials: 3 (3 TERMINATED)
+-------------------------------+------------+-------+---------+--------+----------+--------+------------------+-----------------+
| Trial name                    | status     | loc   |   alpha |   beta |     loss |   iter |   total time (s) |   neg_mean_loss |
|-------------------------------+------------+-------+---------+--------+----------+--------+------------------+-----------------|
| training_function_18f3e_00000 | TERMINATED |       |   0.001 |      3 | 10.291   |     10 |        0.0950978 |       -10.291   |
| training_function_18f3e_00001 | TERMINATED |       |   0.01  |      1 | 10.0108  |     10 |        0.150772  |       -10.0108  |
| training_function_18f3e_00002 | TERMINATED |       |   0.1   |      2 |  9.37431 |     10 |        0.138224  |        -9.37431 |
+-------------------------------+------------+-------+---------+--------+----------+--------+------------------+-----------------+
richardliaw commented 3 years ago

Can you post the core file? I've never seen it and would like to know what's inside.

MrRace commented 3 years ago

Can you post the core file? I've never seen it and would like to know what's inside.

It is a binary file, would you need?

MrRace commented 3 years ago

@richardliaw The core file can be download from https://drive.google.com/file/d/13EoT0RMlJbLEZAfIFpUg32QCyOKzRc-b/view?usp=sharing

amogkam commented 3 years ago

@MrRace you said you were running this on Docker right? Would you be able to try running the Quick Start example without using Docker to see if the core file still gets created? I've never seen this before so just trying to isolate the problem.

MrRace commented 3 years ago

@amogkam I try run the Quick Start example outer the docker and it does not create the core file. The docker I used comes from pytorch/pytorch:1.6.0-cuda10.1-cudnn7-runtime, maybe you can also get the reproduction. Looking forward to solve the unexpected problem. Best regards!

richardliaw commented 3 years ago

great catch! Can you try this: https://stackoverflow.com/questions/58704192/how-to-disable-core-file-dumps-in-docker-container

MrRace commented 3 years ago

great catch! Can you try this: https://stackoverflow.com/questions/58704192/how-to-disable-core-file-dumps-in-docker-container

Therefore just disable the core file?Is there some other risk underlying ?

richardliaw commented 3 years ago

No security risk afaik. Did it work?

MrRace commented 3 years ago

No security risk afaik. Did it work?

Yes, It can work. Thanks a lot!

richardliaw commented 3 years ago

Awesome!