Running test_imagenet_model.py very slowly

momo1986 commented 4 years ago

Hello, dear guys from thu-ml.

Thanks for your program.

I try to run test_imagenet_model.py for imagenet datasets with L∞ Attack.

The program runs very slowly.

I notice the requirement.txt set "tensorflow=1.15.4", not the "tensorflow-gpu=1.15.4".

I install with tensorflow-gpu 1.15.4 and make sure tensorflow.test.is_gpu_available() 's value TRUE.

I also notice:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)

My question is:

1) What others should I do when I run gpu program based on ImageNet.

2) I add adversarial attack code in: https://github.com/thu-ml/realsafe/blob/master/realsafe/dataset/imagenet.py

def _load_image(filename, to_height, to_width, clip):
    ''' Load image into uint8 tensor from file. '''
    img = Image.open(os.path.join(PATH_IMGS, filename))
    if img.mode != 'RGB':
        img = img.convert(mode='RGB')

    if clip:
        img = np.array(img)
        height, width = img.shape[0], img.shape[1]  # pylint: disable=E1136  # pylint/issues/3139
        center = int(0.875 * min(height, width))
        offset_height, offset_width = (height - center + 1) // 2, (width - center + 1) // 2
        img = img[offset_height:offset_height+center, offset_width:offset_width+center, :]
        img = Image.fromarray(img)

    return tf.convert_to_tensor(np.array(img.resize((to_height, to_width))))

Before "if clip" and after " img = img.convert(mode='RGB')" is my modification, would this modification before tensor conversion and placeholder filling slow down the operation speed?

Thanks & Regards! Momo

Fugoes commented 4 years ago

Thank you for reporting the problem!

The program runs very slowly.

This is because some behavior changes from tensorflow-1.15.2 to tensorflow-1.15.4. Current code use PIL to load Image, which does not play well with newer version of tensorflow. Current code is tested under tensorflow-1.15.2. I bumped the version recently due to some security problem in old version of tensorflow. Sorry for the inconvenience.

You could downgrade your tensorflow to 1.15.2, or you could wait me for a simple fix soon.

What others should I do when I run gpu program based on ImageNet.

Some ImageNet models are quite large, you might need big enough GPU memory (~10G).

momo1986 commented 4 years ago

Thank you for reporting the problem!

The program runs very slowly.

This is because some behavior changes from tensorflow-1.15.2 to tensorflow-1.15.4. Current code use PIL to load Image, which does not play well with newer version of tensorflow. Current code is tested under tensorflow-1.15.2. I bumped the version recently due to some security problem in old version of tensorflow. Sorry for the inconvenience.

You could downgrade your tensorflow to 1.15.2, or you could wait me for a simple fix soon.

What others should I do when I run gpu program based on ImageNet.

Some ImageNet models are quite large, you might need big enough GPU memory (~10G).

Hello, @Fugoes Fu.

Since "Loading tf Imagenet-pretrained model" nees a lot of time and memory, is there method to see the schedule of loading, also to see whether the loading is blocked or failed.

I work on 2080ti with 10.8G memory, perhaps the effieciency to run the program is important when using realsafe.

Could you give some tips?

Thanks & Regards! Momo

momo1986 commented 4 years ago

Because sometimes the program stopped at:

Instructions for updating:
Use standard file APIs to check for files with this prefix.
2020-10-13 03:34:25.210964: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-10-13 03:34:25.211126: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-13 03:34:25.211159: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2020-10-13 03:34:25.211179: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2020-10-13 03:34:25.983325: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9929 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:08:00.0, compute capability: 7.5)

Since without further log, it will not convenient for debuging.

Thanks & Regards!

Fugoes commented 4 years ago

Since without further log, it will not convenient for debuging.

This behavior is quite strange. Does the NVIDIA driver works fine? Does sudo dmesg print errors about NVIDIA?

momo1986 commented 4 years ago

totalMemory: 10.76GiB freeMemory: 10.21GiB

It is computation power I can manipulate, is it hardworking to load tensorflow's InceptionV3 and Ensemble InceptionV3 on ImageNet?

momo1986 commented 4 years ago

Here is Nvidia Log：

2020-10-13 04:02:41.126940: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2020-10-13 04:02:41.142369: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2099940000 Hz 2020-10-13 04:02:41.146588: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x555a0826bdc0 executing computations on platform Host. Devices: 2020-10-13 04:02:41.146661: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): , 2020-10-13 04:02:42.281625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545 pciBusID: 0000:08:00.0 totalMemory: 10.76GiB freeMemory: 10.21GiB 2020-10-13 04:02:42.281688: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0 2020-10-13 04:02:42.282785: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-10-13 04:02:42.282812: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 2020-10-13 04:02:42.282823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N 2020-10-13 04:02:42.282930: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9929 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:08:00.0, compute capability: 7.5) 2020-10-13 04:02:42.286556: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x555a08acead0 executing computations on platform CUDA. Devices: 2020-10-13 04:02:42.286614: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5 WARNING:tensorflow:From /usr/local/miniconda3/envs/dl10/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. WARNING:tensorflow:From /usr/local/miniconda3/envs/dl10/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to check for files with this prefix. 2020-10-13 04:02:55.418225: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0 2020-10-13 04:02:55.418367: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-10-13 04:02:55.418398: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 2020-10-13 04:02:55.418418: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N 2020-10-13 04:02:56.162556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9929 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:08:00.0, compute capability: 7.5)

Fugoes commented 4 years ago

totalMemory: 10.76GiB freeMemory: 10.21GiB

It is computation power I can manipulate, is it hardworking to load tensorflow's InceptionV3 and Ensemble InceptionV3 on ImageNet?

Should work fine with a reasonable batch size (~50 or 100).

momo1986 commented 4 years ago

Current batch-size is 10, is it not suitable? @Fugoes

Fugoes commented 4 years ago

Current batch-size is 10, is it not suitable? @Fugoes

Should work fine for all ImageNet pre-trained models in RealSafe.

Fugoes commented 4 years ago

Thank you for reporting the problem!

The program runs very slowly.

This is because some behavior changes from tensorflow-1.15.2 to tensorflow-1.15.4. Current code use PIL to load Image, which does not play well with newer version of tensorflow. Current code is tested under tensorflow-1.15.2. I bumped the version recently due to some security problem in old version of tensorflow. Sorry for the inconvenience.

You could downgrade your tensorflow to 1.15.2, or you could wait me for a simple fix soon.

What others should I do when I run gpu program based on ImageNet.

Some ImageNet models are quite large, you might need big enough GPU memory (~10G).

I fix this performance degradation when loading dataset in https://github.com/thu-ml/realsafe/commit/39f632e950562fa00ac26d34d13b2691c9c5f013. Check the commit message for why it happens :).

momo1986 commented 4 years ago

Thanks, I also find some cues that it works fine when I run "python test_imagnet_models.py" alone but blocked when I run"python test_imagenet_models.py | tee debug.log". Looks the pipe operation are restricted. Is there any work-around to store the running-log when run big models in realsafe?

Thanks & reagds!

Fugoes commented 4 years ago

Thanks, I also find some cues that it works fine when I run "python test_imagnet_models.py" alone but blocked when I run"python test_imagenet_models.py | tee debug.log". Looks the pipe operation are restricted. Is there any work-around to store the running-log when run big models in realsafe?

When use python with pipe, python would enable buffered output for both stdout and stderr. To disable this behaviour, use python -u instead.

momo1986 commented 4 years ago

Hello， @Fugoes Fu.

Thanks for your tips.

Also, what tensorflow-version is recommended for realsafe, maybe it is better to give an official version range.

Thanks & Regards!

Fugoes commented 4 years ago

Thanks for your tips.

You are welcome.

Also, what tensorflow-version is recommended for realsafe, maybe it is better to give an official version range.

Thank for your advice. We suggest tensorflow>=1.13 (in the README.md).

I will close the issue.

thu-ml / ares

Running test_imagenet_model.py very slowly #6