tensorflow / benchmarks

A benchmark framework for Tensorflow
Apache License 2.0
1.14k stars 632 forks source link

Benchmark hangs for non syntetic data #17

Closed ghost closed 7 years ago

ghost commented 7 years ago

I tried to run

# VGG16 training ImageNet with 8 GPUs using arguments that optimize for
# Google Compute Engine.
python tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=1 \
--batch_size=32 --model=vgg16 --data_dir=/home/ubuntu/flowers \
--variable_update=parameter_server --nodistortions

And the data dir has the TF Records inside, generated with bazel as in the models/inception/data tutorial

-rw-rwx--- 1  40 May 11 11:43 labels.txt
drwxrwx--- 7 4096 May 12 11:45 train
-rw-rwx--- 1  102419300 May 11 11:43 train-00000-of-00002
-rw-rwx--- 1   99116804 May 11 11:43 train-00001-of-00002
drwxrwx--- 7  4096 May 12 11:45 validation
-rw-rwx--- 1  16058779 May 11 11:43 validation-00000-of-00002
-rw-rwx--- 1  15919237 May 11 11:43 validation-00001-of-00002

And it hangs like this:

TensorFlow:  1.1
Model:       vgg16
Mode:        training
Batch size:  32 global
             32.0 per device
Devices:     ['/gpu:0']
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
==========
Generating model
2017-05-12 11:57:30.357629: I tensorflow/core/common_runtime/gpu/gpu_device.cc:900] Found device 0 with properties:
....
pciBusID 0002:01:00.0
Total memory: 15.89GiB
Free memory: 15.61GiB
2017-05-12 11:57:30.357680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:921] DMA: 0
2017-05-12 11:57:30.357690: I tensorflow/core/common_runtime/gpu/gpu_device.cc:931] 0:   Y
2017-05-12 11:57:30.357707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0002:01:00.0)

But for syntatic data it works. Any idea how to fix this?

tfboyd commented 7 years ago

I believe the script expects a certain number of files. We only test with ImageNet and I think someone reported to me recently that it hangs if it runs out of records. I did not test with Flowers so it is very possible something does not match up.

tfboyd commented 7 years ago

Flowers is most likely not working and we are removing it. Sorry for including it as an option. It is not something we used for testing.

ghost commented 7 years ago

So what small dataset can i use for testing? On Sat, 3 Jun 2017 at 15:33, Toby Boyd notifications@github.com wrote:

Flowers is most likely not working and we are removing it. Sorry for including it as an option. It is not something we used for testing.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/17#issuecomment-305975374, or mute the thread https://github.com/notifications/unsubscribe-auth/ADyiqY2r_BSGnF2W6n6odEQ2hZ0grsMPks5sAWClgaJpZM4NZFvP .

cwhipkey commented 7 years ago

You can run against a fewer number of shards of imagenet.

tfboyd commented 7 years ago

If Imagenet did not require an agreement to not use commercially or something like that, I would send you a link to some already processed files, e.g. you could just take:

train-00119-of-01024 through train-00300-10024 rather than the entire set. Unless you already have them processed and stored somewhere you still need to run through the process to create them. It takes about 1/2 a day or so on a fast connection and ~500GB or less if you delete as you go.

Toby

On Mon, Jun 5, 2017 at 12:58 PM, cwhipkey notifications@github.com wrote:

You can run against a fewer number of shards of imagenet.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/17#issuecomment-306287136, or mute the thread https://github.com/notifications/unsubscribe-auth/AWZesuWGoq4WZyPPuxZ-L4dZ4baHPH_Hks5sBF37gaJpZM4NZFvP .

ghost commented 7 years ago

I don't need the link, I have the data already downloaded and preprocessed, many thanks. But now my question is, since @Toby said that "I believe the script expects a certain number of files. ", will it work if i only take a subsection of the total number of records?

On Mon, Jun 5, 2017 at 10:04 PM, Toby Boyd notifications@github.com wrote:

If Imagenet did not require an agreement to not use commercially or something like that, I would send you a link to some already processed files, e.g. you could just take:

train-00119-of-01024 through train-00300-10024 rather than the entire set. Unless you already have them processed and stored somewhere you still need to run through the process to create them. It takes about 1/2 a day or so on a fast connection and ~500GB or less if you delete as you go.

Toby

On Mon, Jun 5, 2017 at 12:58 PM, cwhipkey notifications@github.com wrote:

You can run against a fewer number of shards of imagenet.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/ 17#issuecomment-306287136, or mute the thread https://github.com/notifications/unsubscribe-auth/AWZesuWGoq4WZyPPuxZ- L4dZ4baHPH_Hks5sBF37gaJpZM4NZFvP

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/17#issuecomment-306289385, or mute the thread https://github.com/notifications/unsubscribe-auth/ADyiqSQNfv1sXaBkSlnRuSXa8jBVOobFks5sBF9DgaJpZM4NZFvP .

tfboyd commented 7 years ago

I did not know anything about Flowers and was making a guess. I am fairly sure I have tested with a few files in the past. I will give it a try today and will be doing what you would do. Copy a few files to my /tmp folder set that as my data_dir and run a few hundred iterations.

On Mon, Jun 5, 2017 at 11:54 PM, IstrateRoxana notifications@github.com wrote:

I don't need the link, I have the data already downloaded and preprocessed, many thanks. But now my question is, since @Toby said that I believe the script expects a certain number of files., will it work if i only take a subsection of the total number of records?

On Mon, Jun 5, 2017 at 10:04 PM, Toby Boyd notifications@github.com wrote:

If Imagenet did not require an agreement to not use commercially or something like that, I would send you a link to some already processed files, e.g. you could just take:

train-00119-of-01024 through train-00300-10024 rather than the entire set. Unless you already have them processed and stored somewhere you still need to run through the process to create them. It takes about 1/2 a day or so on a fast connection and ~500GB or less if you delete as you go.

Toby

On Mon, Jun 5, 2017 at 12:58 PM, cwhipkey notifications@github.com wrote:

You can run against a fewer number of shards of imagenet.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/ 17#issuecomment-306287136, or mute the thread https://github.com/notifications/unsubscribe- auth/AWZesuWGoq4WZyPPuxZ- L4dZ4baHPH_Hks5sBF37gaJpZM4NZFvP

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/ 17#issuecomment-306289385, or mute the thread https://github.com/notifications/unsubscribe-auth/ ADyiqSQNfv1sXaBkSlnRuSXa8jBVOobFks5sBF9DgaJpZM4NZFvP

.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/17#issuecomment-306398611, or mute the thread https://github.com/notifications/unsubscribe-auth/AWZeskNhaj12XiSW8Ie5zkSJSMDGuuz_ks5sBPebgaJpZM4NZFvP .

ghost commented 7 years ago

Many thanks, please keep me updated.

On Tue, Jun 6, 2017 at 5:00 PM, Toby Boyd notifications@github.com wrote:

I did not know anything about Flowers and was making a guess. I am fairly sure I have tested with a few files in the past. I will give it a try today and will be doing what you would do. Copy a few files to my /tmp folder set that as my data_dir and run a few hundred iterations.

On Mon, Jun 5, 2017 at 11:54 PM, IstrateRoxana notifications@github.com wrote:

I don't need the link, I have the data already downloaded and preprocessed, many thanks. But now my question is, since @Toby said that I believe the script expects a certain number of files., will it work if i only take a subsection of the total number of records?

On Mon, Jun 5, 2017 at 10:04 PM, Toby Boyd notifications@github.com wrote:

If Imagenet did not require an agreement to not use commercially or something like that, I would send you a link to some already processed files, e.g. you could just take:

train-00119-of-01024 through train-00300-10024 rather than the entire set. Unless you already have them processed and stored somewhere you still need to run through the process to create them. It takes about 1/2 a day or so on a fast connection and ~500GB or less if you delete as you go.

Toby

On Mon, Jun 5, 2017 at 12:58 PM, cwhipkey notifications@github.com wrote:

You can run against a fewer number of shards of imagenet.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/ 17#issuecomment-306287136, or mute the thread https://github.com/notifications/unsubscribe- auth/AWZesuWGoq4WZyPPuxZ- L4dZ4baHPH_Hks5sBF37gaJpZM4NZFvP

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/ 17#issuecomment-306289385, or mute the thread https://github.com/notifications/unsubscribe-auth/ ADyiqSQNfv1sXaBkSlnRuSXa8jBVOobFks5sBF9DgaJpZM4NZFvP

.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/ 17#issuecomment-306398611, or mute the thread https://github.com/notifications/unsubscribe-auth/ AWZeskNhaj12XiSW8Ie5zkSJSMDGuuz_ks5sBPebgaJpZM4NZFvP

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/17#issuecomment-306513962, or mute the thread https://github.com/notifications/unsubscribe-auth/ADyiqaR6-svMP0RTZjVDpKvmdN8Qv_ecks5sBWmcgaJpZM4NZFvP .

tfboyd commented 7 years ago

/imagenet$ mkdir /tmp/image_net /imagenet$ cp train-001* /tmp/image_n python tf_cnn_benchmarks.py --data_dir=/tmp/image_net/ --model=inception3 --batch_size=32 --data_name=imagenet --num_batches=1000

I actually stopped it at 148 so I have ~48 "chunks". I ran 1000 batches

On Tue, Jun 6, 2017 at 9:20 AM, IstrateRoxana notifications@github.com wrote:

Many thanks, please keep me updated.

On Tue, Jun 6, 2017 at 5:00 PM, Toby Boyd notifications@github.com wrote:

I did not know anything about Flowers and was making a guess. I am fairly sure I have tested with a few files in the past. I will give it a try today and will be doing what you would do. Copy a few files to my /tmp folder set that as my data_dir and run a few hundred iterations.

On Mon, Jun 5, 2017 at 11:54 PM, IstrateRoxana <notifications@github.com

wrote:

I don't need the link, I have the data already downloaded and preprocessed, many thanks. But now my question is, since @Toby said that I believe the script expects a certain number of files., will it work if i only take a subsection of the total number of records?

On Mon, Jun 5, 2017 at 10:04 PM, Toby Boyd notifications@github.com wrote:

If Imagenet did not require an agreement to not use commercially or something like that, I would send you a link to some already processed files, e.g. you could just take:

train-00119-of-01024 through train-00300-10024 rather than the entire set. Unless you already have them processed and stored somewhere you still need to run through the process to create them. It takes about 1/2 a day or so on a fast connection and ~500GB or less if you delete as you go.

Toby

On Mon, Jun 5, 2017 at 12:58 PM, cwhipkey notifications@github.com wrote:

You can run against a fewer number of shards of imagenet.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/ 17#issuecomment-306287136, or mute the thread https://github.com/notifications/unsubscribe- auth/AWZesuWGoq4WZyPPuxZ- L4dZ4baHPH_Hks5sBF37gaJpZM4NZFvP

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/ 17#issuecomment-306289385, or mute the thread https://github.com/notifications/unsubscribe-auth/ ADyiqSQNfv1sXaBkSlnRuSXa8jBVOobFks5sBF9DgaJpZM4NZFvP

.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/ 17#issuecomment-306398611, or mute the thread https://github.com/notifications/unsubscribe-auth/ AWZeskNhaj12XiSW8Ie5zkSJSMDGuuz_ks5sBPebgaJpZM4NZFvP

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/ 17#issuecomment-306513962, or mute the thread https://github.com/notifications/unsubscribe-auth/ADyiqaR6- svMP0RTZjVDpKvmdN8Qv_ecks5sBWmcgaJpZM4NZFvP

.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/17#issuecomment-306539236, or mute the thread https://github.com/notifications/unsubscribe-auth/AWZesiYJn6JpjrVBmWqCFVe7yqFSqGTLks5sBXw_gaJpZM4NZFvP .

ghost commented 7 years ago

I will give it a try exactly as you said and check if works, though it should since no modification to the code has been made. Can you tell me which is the hash of your current commit? To make sure i run the same version. Also, any idea why for flowers it wouldn't work? Should be data independent, no?

On Tue, Jun 6, 2017 at 6:35 PM, Toby Boyd notifications@github.com wrote:

/imagenet$ mkdir /tmp/image_net /imagenet$ cp train-001* /tmp/image_n python tf_cnn_benchmarks.py --data_dir=/tmp/image_net/ --model=inception3 --batch_size=32 --data_name=imagenet --num_batches=1000

I actually stopped it at 148 so I have ~48 "chunks". I ran 1000 batches

On Tue, Jun 6, 2017 at 9:20 AM, IstrateRoxana notifications@github.com wrote:

Many thanks, please keep me updated.

On Tue, Jun 6, 2017 at 5:00 PM, Toby Boyd notifications@github.com wrote:

I did not know anything about Flowers and was making a guess. I am fairly sure I have tested with a few files in the past. I will give it a try today and will be doing what you would do. Copy a few files to my /tmp folder set that as my data_dir and run a few hundred iterations.

On Mon, Jun 5, 2017 at 11:54 PM, IstrateRoxana < notifications@github.com

wrote:

I don't need the link, I have the data already downloaded and preprocessed, many thanks. But now my question is, since @Toby said that I believe the script expects a certain number of files., will it work if i only take a subsection of the total number of records?

On Mon, Jun 5, 2017 at 10:04 PM, Toby Boyd <notifications@github.com

wrote:

If Imagenet did not require an agreement to not use commercially or something like that, I would send you a link to some already processed files, e.g. you could just take:

train-00119-of-01024 through train-00300-10024 rather than the entire set. Unless you already have them processed and stored somewhere you still need to run through the process to create them. It takes about 1/2 a day or so on a fast connection and ~500GB or less if you delete as you go.

Toby

On Mon, Jun 5, 2017 at 12:58 PM, cwhipkey < notifications@github.com> wrote:

You can run against a fewer number of shards of imagenet.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/ 17#issuecomment-306287136, or mute the thread https://github.com/notifications/unsubscribe- auth/AWZesuWGoq4WZyPPuxZ- L4dZ4baHPH_Hks5sBF37gaJpZM4NZFvP

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/ 17#issuecomment-306289385, or mute the thread https://github.com/notifications/unsubscribe-auth/ ADyiqSQNfv1sXaBkSlnRuSXa8jBVOobFks5sBF9DgaJpZM4NZFvP

.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/ 17#issuecomment-306398611, or mute the thread https://github.com/notifications/unsubscribe-auth/ AWZeskNhaj12XiSW8Ie5zkSJSMDGuuz_ks5sBPebgaJpZM4NZFvP

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/ 17#issuecomment-306513962, or mute the thread https://github.com/notifications/unsubscribe-auth/ADyiqaR6- svMP0RTZjVDpKvmdN8Qv_ecks5sBWmcgaJpZM4NZFvP

.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/ 17#issuecomment-306539236, or mute the thread https://github.com/notifications/unsubscribe-auth/ AWZesiYJn6JpjrVBmWqCFVe7yqFSqGTLks5sBXw_gaJpZM4NZFvP

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/17#issuecomment-306543621, or mute the thread https://github.com/notifications/unsubscribe-auth/ADyiqcQxt4eIdgqAHJE81gRcB-CQQ_Mpks5sBX-8gaJpZM4NZFvP .

tfboyd commented 7 years ago

I am running my local copy which I know is not exact but I am really confident in this case it does not matter. I did a push to fix the VGG issue last week. I would just pull Master. We never tested with Flowers and it was a mistake to leave it there. I have never used Flowers so I just don't know. If using Imagenet I suspect you could run with one "chunk" and be fine. I am virtually certain there are "lazy" team members that just test with a few files because they are working on performance not specifically testing accuracy or convergence.

On Tue, Jun 6, 2017 at 9:37 AM, IstrateRoxana notifications@github.com wrote:

I will give it a try exactly as you said and check if works, though it should since no modification to the code has been made. Can you tell me which is the hash of your current commit? To make sure i run the same version. Also, any idea why for flowers it wouldn't work? Should be data independent, no?

On Tue, Jun 6, 2017 at 6:35 PM, Toby Boyd notifications@github.com wrote:

/imagenet$ mkdir /tmp/image_net /imagenet$ cp train-001* /tmp/image_n python tf_cnn_benchmarks.py --data_dir=/tmp/image_net/ --model=inception3 --batch_size=32 --data_name=imagenet --num_batches=1000

I actually stopped it at 148 so I have ~48 "chunks". I ran 1000 batches

On Tue, Jun 6, 2017 at 9:20 AM, IstrateRoxana notifications@github.com wrote:

Many thanks, please keep me updated.

On Tue, Jun 6, 2017 at 5:00 PM, Toby Boyd notifications@github.com wrote:

I did not know anything about Flowers and was making a guess. I am fairly sure I have tested with a few files in the past. I will give it a try today and will be doing what you would do. Copy a few files to my /tmp folder set that as my data_dir and run a few hundred iterations.

On Mon, Jun 5, 2017 at 11:54 PM, IstrateRoxana < notifications@github.com

wrote:

I don't need the link, I have the data already downloaded and preprocessed, many thanks. But now my question is, since @Toby said that I believe the script expects a certain number of files., will it work if i only take a subsection of the total number of records?

On Mon, Jun 5, 2017 at 10:04 PM, Toby Boyd < notifications@github.com

wrote:

If Imagenet did not require an agreement to not use commercially or something like that, I would send you a link to some already processed files, e.g. you could just take:

train-00119-of-01024 through train-00300-10024 rather than the entire set. Unless you already have them processed and stored somewhere you still need to run through the process to create them. It takes about 1/2 a day or so on a fast connection and ~500GB or less if you delete as you go.

Toby

On Mon, Jun 5, 2017 at 12:58 PM, cwhipkey < notifications@github.com> wrote:

You can run against a fewer number of shards of imagenet.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/ 17#issuecomment-306287136, or mute the thread https://github.com/notifications/unsubscribe- auth/AWZesuWGoq4WZyPPuxZ- L4dZ4baHPH_Hks5sBF37gaJpZM4NZFvP

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/ 17#issuecomment-306289385, or mute the thread https://github.com/notifications/unsubscribe-auth/ ADyiqSQNfv1sXaBkSlnRuSXa8jBVOobFks5sBF9DgaJpZM4NZFvP

.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/ 17#issuecomment-306398611, or mute the thread https://github.com/notifications/unsubscribe-auth/ AWZeskNhaj12XiSW8Ie5zkSJSMDGuuz_ks5sBPebgaJpZM4NZFvP

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/ 17#issuecomment-306513962, or mute the thread https://github.com/notifications/unsubscribe-auth/ADyiqaR6- svMP0RTZjVDpKvmdN8Qv_ecks5sBWmcgaJpZM4NZFvP

.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/ 17#issuecomment-306539236, or mute the thread https://github.com/notifications/unsubscribe-auth/ AWZesiYJn6JpjrVBmWqCFVe7yqFSqGTLks5sBXw_gaJpZM4NZFvP

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/ 17#issuecomment-306543621, or mute the thread https://github.com/notifications/unsubscribe-auth/ ADyiqcQxt4eIdgqAHJE81gRcB-CQQ_Mpks5sBX-8gaJpZM4NZFvP

.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/17#issuecomment-306544376, or mute the thread https://github.com/notifications/unsubscribe-auth/AWZesnHhKMmW17AA0eA0Gwh4x8afUMZCks5sBYBggaJpZM4NZFvP .

digshock commented 7 years ago

It's RecordInput bug.

https://github.com/tensorflow/tensorflow/issues/11396

reduce buffer_size. ex) buffer_size=1000. It work.

      record_input = data_flow_ops.RecordInput(
          file_pattern=dataset.tf_record_pattern(subset),
          seed=301,
          parallelism=64,
          buffer_size=10000,
          batch_size=self.batch_size,
          name='record_input')
tfboyd commented 7 years ago

Thank you @digshock and for linking the related issue in core TF.