tensorflow / models

Models and examples built with TensorFlow
Other
77k stars 45.78k forks source link

Running Resnet50 model in a secured cluster #9094

Closed ashiqimranintel closed 4 years ago

ashiqimranintel commented 4 years ago

I am trying to run resnet50 model using local tfrecords from a directory. I am trying to run the model on a secured cluster where there is no internet connection. I couldn't able to figure out how to avoid following error message.

ConnectionError: Failed to construct dataset imagenet2012HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /tfds-data/?prefix=dataset_info/imagenet2012/5.0.0/ (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x2b4997e41310>: Failed to establish a new connection: [Errno -2] Name or service not known')

Seeking help for running resnet50 model in an environment where no internet connection available.

saikumarchalla commented 4 years ago

@ashiqimranintel Could you please fill the issue template. Also,provide the top-level directory of the model you are using.Thanks!

ashiqimranintel commented 4 years ago

models/official/vision/image_classification -> classification_trainer.py I am trying to use TFrecords not the TFDS. Here I put logs which I am seeing when I ran the classification_trainner.py.

I0812 15:47:06.900887 139912474634048 dataset_factory.py:358] Using TFRecords to load data. I0812 15:47:07.527647 139912474634048 dataset_factory.py:358] Using TFRecords to load data. I0812 15:47:08.045007 139912474634048 dataset_info.py:430] Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: imagenet2012/5.0.0 I0812 15:47:09.308397 139912474634048 dataset_info.py:361] Load dataset info from /tmp/tmpa2btiyy0tfds

My question is how can I avoid the last two lines when I am using TFrecords, since dataset_info.py is trying to get DatasetInfo using internet.

allenwang28 commented 4 years ago

Please try updating to this commit: https://github.com/tensorflow/models/commit/8a10870632f6d4f008329fb15cd6af9a15a4a7ff

This should bypass any calls to TFDS if using the TFRecord pipeline with ImageNet.

ashiqimranintel commented 4 years ago

Still getting the same issue, I0812 15:47:08.045007 139912474634048 dataset_info.py:430] Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: imagenet2012/5.0.0 I0812 15:47:09.308397 139912474634048 dataset_info.py:361] Load dataset info from /tmp/tmpa2btiyy0tfds

Any more suggestion?

allenwang28 commented 4 years ago

Hmm, I guess the change I submitted would still call TFDS but would eventually lead to a ConnectionError and continue.

Does deleting this entire portion help? https://github.com/tensorflow/models/blob/master/official/vision/image_classification/dataset_factory.py#L273-L280

ashiqimranintel commented 4 years ago

If I delete the entire portion, It gets broken.

AttributeError: 'NoneType' object has no attribute 'features'

allenwang28 commented 4 years ago

Could you provide the command you're using to run? This should only happen if any of num_channels, num_examples, batch_size or image_size are set to 'infer': https://github.com/tensorflow/models/blob/master/official/vision/image_classification/configs/examples/resnet/imagenet/gpu.yaml#L8-L20

ashiqimranintel commented 4 years ago

Bingo, it worked. Thank you so much!!!