microsoft / CameraTraps

PyTorch Wildlife: a Collaborative Deep Learning Framework for Conservation.
https://cameratraps.readthedocs.io/en/latest/
MIT License
781 stars 246 forks source link

MegaDetector: broken data stream when reading image file / cannot join current thread #104

Closed bencevans closed 5 years ago

bencevans commented 5 years ago

Before I start digging into it further, has anyone come across the following problem? I've run the detector twice and got the same result... thinking along the lines of corrupt file or faulty disk?

Potentially a duplicate of #94 but doesn't contain any logs so unsure.

$ PYTHONPATH=$PYTHONPATH:$(pwd) python3 detection/run_tf_detector_batch.py --recursive --forceCpu --checkpointFrequency 1000
 --outputRelativeFilenames ./detection/megadetector_v3.pb ../borneo-dataset/release/0.5/SAFE/SAFE_2/ ../Borneo-0.5-SAFE2.txt

tensorflow tf version: 1.14.0                                     
2019-10-03 12:28:29.084524: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: FMA
2019-10-03 12:28:29.381379: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2400065000 Hz
2019-10-03 12:28:29.386639: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3d43f60 executing computations on platform Host. Devices:
2019-10-03 12:28:29.386727: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
tf_detector.py, tf.test.is_gpu_available: False                                               
WARNING: Logging before flag parsing goes to stderr.                            
W1003 12:28:29.404660 140666144171840 deprecation_wrapper.py:119] From detection/run_tf_detector_batch.py:51: The name tf.logging.set_verbosity is deprecated. Please us
e tf.compat.v1.logging.set_verbosity instead.                                                                 

W1003 12:28:29.405052 140666144171840 deprecation_wrapper.py:119] From detection/run_tf_detector_batch.py:51: The name tf.logging.ERROR is deprecated. Please use tf.com
pat.v1.logging.ERROR instead. 

Running detector on 57170 images                       
Loading model...                                                                                             
tf_detector.py: Loading graph...                                                
tf_detector.py: Detection graph loaded.                     
Loaded model in 15.1 seconds                                                                                  
Running detector...                     
0it [00:00, ?it/s]2019-10-03 12:28:55.971887: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because env
var TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XL
A is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
var TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm [0/498]
A is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
2019-10-03 12:29:01.395446: W tensorflow/core/framework/allocator.cc:107] Allocation of 377318400 exceeds 10% of system memory.
2019-10-03 12:29:01.487368: W tensorflow/core/framework/allocator.cc:107] Allocation of 377318400 exceeds 10% of system memory.
2019-10-03 12:29:01.982219: W tensorflow/core/framework/allocator.cc:107] Allocation of 99878400 exceeds 10% of system memory.
2019-10-03 12:29:02.555869: W tensorflow/core/framework/allocator.cc:107] Allocation of 159744000 exceeds 10% of system memory.
2019-10-03 12:29:02.715443: W tensorflow/core/framework/allocator.cc:107] Allocation of 159744000 exceeds 10% of system memory.
Checkpointing 1 images to /tmp/detector_batch/tmpud3njeui......done
1000it [2:11:24,  7.87s/it]Checkpointing 1001 images to /tmp/detector_batch/tmp6wgeh416......done                                                                       
2000it [4:23:28,  7.90s/it]Checkpointing 2001 images to /tmp/detector_batch/tmp_6tnl91e......done                                                                       
3000it [6:36:44,  8.07s/it]Checkpointing 3001 images to /tmp/detector_batch/tmpnwm51sek......done                                                       
3344it [7:23:24,  8.08s/it]Traceback (most recent call last):                                                                  
  File "detection/run_tf_detector_batch.py", line 559, in <module>                                                             
    main()                                                                                                                    
  File "detection/run_tf_detector_batch.py", line 554, in main                                                                 
    load_and_run_detector(options)                                                                                             
  File "detection/run_tf_detector_batch.py", line 437, in load_and_run_detector
    boxes,scores,classes,imageFileNames = generate_detections(detector,imageFileNames,options)   
  File "detection/run_tf_detector_batch.py", line 167, in generate_detections                    
    imageNP = PIL.Image.open(image).convert("RGB"); imageNP = np.array(imageNP)                  
  File "/home/bencevans/.local/lib/python3.6/site-packages/PIL/Image.py", line 912, in convert
    self.load()                                                   
  File "/home/bencevans/.local/lib/python3.6/site-packages/PIL/ImageFile.py", line 261, in load
    raise_ioerror(err_code)                                   
  File "/home/bencevans/.local/lib/python3.6/site-packages/PIL/ImageFile.py", line 58, in raise_ioerror
    raise IOError(message + " when reading image file")                        
OSError: broken data stream when reading image file                                           
Exception ignored in: <bound method tqdm.__del__ of 3344it [7:23:25,  8.08s/it]>
Traceback (most recent call last):                                             
  File "/home/bencevans/.local/lib/python3.6/site-packages/tqdm/_tqdm.py", line 931, in __del__
    self.close()
  File "/home/bencevans/.local/lib/python3.6/site-packages/tqdm/_tqdm.py", line 1133, in close 
    self._decr_instances(self)
  File "/home/bencevans/.local/lib/python3.6/site-packages/tqdm/_tqdm.py", line 496, in _decr_instances
    cls.monitor.exit()                                 
  File "/home/bencevans/.local/lib/python3.6/site-packages/tqdm/_monitor.py", line 52, in exit
    self.join()                                                                 
  File "/usr/lib/python3.6/threading.py", line 1053, in join
    raise RuntimeError("cannot join current thread")                                                          
RuntimeError: cannot join current thread
bencevans commented 5 years ago

Running on master at 2f4f5a42807a71abecafa08bf8b49b052efd0e16

amritagupta commented 5 years ago

Yes, I've faced this before due to a corrupt file. I usually try/catch around the PIL.Image.open line, and resume running the detector from a saved checkpoint to pick up where it left off.

On Mon, Oct 14, 2019 at 10:29 AM Ben Evans notifications@github.com wrote:

Before I start digging into it further, has anyone come across the following problem? I've run the detector twice and got the same result... thinking along the lines of corrupt file or faulty disk?

Potentially a duplicate of #94 https://github.com/microsoft/CameraTraps/issues/94 but doesn't contain any logs so unsure.

$ PYTHONPATH=$PYTHONPATH:$(pwd) python3 detection/run_tf_detector_batch.py --recursive --forceCpu --checkpointFrequency 1000 --outputRelativeFilenames ./detection/megadetector_v3.pb ../borneo-dataset/release/0.5/SAFE/SAFE_2/ ../Borneo-0.5-SAFE2.txt

tensorflow tf version: 1.14.0 2019-10-03 12:28:29.084524: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: FMA 2019-10-03 12:28:29.381379: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2400065000 Hz 2019-10-03 12:28:29.386639: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3d43f60 executing computations on platform Host. Devices: 2019-10-03 12:28:29.386727: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): , tf_detector.py, tf.test.is_gpu_available: False WARNING: Logging before flag parsing goes to stderr. W1003 12:28:29.404660 140666144171840 deprecation_wrapper.py:119] From detection/run_tf_detector_batch.py:51: The name tf.logging.set_verbosity is deprecated. Please us e tf.compat.v1.logging.set_verbosity instead.

W1003 12:28:29.405052 140666144171840 deprecation_wrapper.py:119] From detection/run_tf_detector_batch.py:51: The name tf.logging.ERROR is deprecated. Please use tf.com pat.v1.logging.ERROR instead.

Running detector on 57170 images Loading model... tf_detector.py: Loading graph... tf_detector.py: Detection graph loaded. Loaded model in 15.1 seconds Running detector... 0it [00:00, ?it/s]2019-10-03 12:28:55.971887: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because env var TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XL A is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. var TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm [0/498] A is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. 2019-10-03 12:29:01.395446: W tensorflow/core/framework/allocator.cc:107] Allocation of 377318400 exceeds 10% of system memory. 2019-10-03 12:29:01.487368: W tensorflow/core/framework/allocator.cc:107] Allocation of 377318400 exceeds 10% of system memory. 2019-10-03 12:29:01.982219: W tensorflow/core/framework/allocator.cc:107] Allocation of 99878400 exceeds 10% of system memory. 2019-10-03 12:29:02.555869: W tensorflow/core/framework/allocator.cc:107] Allocation of 159744000 exceeds 10% of system memory. 2019-10-03 12:29:02.715443: W tensorflow/core/framework/allocator.cc:107] Allocation of 159744000 exceeds 10% of system memory. Checkpointing 1 images to /tmp/detector_batch/tmpud3njeui......done 1000it [2:11:24, 7.87s/it]Checkpointing 1001 images to /tmp/detector_batch/tmp6wgeh416......done 2000it [4:23:28, 7.90s/it]Checkpointing 2001 images to /tmp/detector_batch/tmp_6tnl91e......done 3000it [6:36:44, 8.07s/it]Checkpointing 3001 images to /tmp/detector_batch/tmpnwm51sek......done 3344it [7:23:24, 8.08s/it]Traceback (most recent call last): File "detection/run_tf_detector_batch.py", line 559, in main() File "detection/run_tf_detector_batch.py", line 554, in main load_and_run_detector(options) File "detection/run_tf_detector_batch.py", line 437, in load_and_run_detector boxes,scores,classes,imageFileNames = generate_detections(detector,imageFileNames,options) File "detection/run_tf_detector_batch.py", line 167, in generate_detections imageNP = PIL.Image.open(image).convert("RGB"); imageNP = np.array(imageNP) File "/home/bencevans/.local/lib/python3.6/site-packages/PIL/Image.py", line 912, in convert self.load() File "/home/bencevans/.local/lib/python3.6/site-packages/PIL/ImageFile.py", line 261, in load raise_ioerror(err_code) File "/home/bencevans/.local/lib/python3.6/site-packages/PIL/ImageFile.py", line 58, in raise_ioerror raise IOError(message + " when reading image file") OSError: broken data stream when reading image file Exception ignored in: <bound method tqdm.del of 3344it [7:23:25, 8.08s/it]> Traceback (most recent call last): File "/home/bencevans/.local/lib/python3.6/site-packages/tqdm/_tqdm.py", line 931, in del self.close() File "/home/bencevans/.local/lib/python3.6/site-packages/tqdm/_tqdm.py", line 1133, in close self._decr_instances(self) File "/home/bencevans/.local/lib/python3.6/site-packages/tqdm/_tqdm.py", line 496, in _decr_instances cls.monitor.exit() File "/home/bencevans/.local/lib/python3.6/site-packages/tqdm/_monitor.py", line 52, in exit self.join() File "/usr/lib/python3.6/threading.py", line 1053, in join raise RuntimeError("cannot join current thread") RuntimeError: cannot join current thread

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/CameraTraps/issues/104?email_source=notifications&email_token=ABJVMSKQEJJFXXTO5GY6ROTQOSUALA5CNFSM4JAR7S22YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HRU4ZKQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJVMSJGIERGT3HCNGOPT7LQOSUALANCNFSM4JAR7S2Q .

-- Amrita Gupta PhD Student School of Computational Science & Engineering Georgia Institute of Technology

agentmorris commented 5 years ago

Can you get latest from master and try again? run_tf_detector_batch was updated last week to put both image loading and inference in a try/except, with reasonable behavior for failed images (warning printed, no output generated).

bencevans commented 5 years ago

Thanks @amritagupta & @agentmorris. The try/except worked and now just checked the current master and works as expected bar a slight formatting which is addressed in #111.