golemfactory / clay

Golem is creating a global market for computing power.
https://golem.network
GNU General Public License v3.0
2.91k stars 286 forks source link

cannot identify image file - current_task_preview #3245

Closed ederenn closed 5 years ago

ederenn commented 6 years ago

Encountered on:

Mac OS GOLEM Version: 0.17.0+dev99.g6d235f1

during computations of a task: helicopter.blend 400x400 10 frames 30 subtasks, 4 subtask had been left uncomputed

in golem.log

2018-08-28 18:53:36 ERROR    golem.client                        task_server.sync_network failed
Traceback (most recent call last):
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/golem/client.py", line 1485, in _run
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/golem/task/taskserver.py", line 140, in sync_network
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/golem/task/taskserver.py", line 881, in __remove_old_tasks
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/golem/task/taskmanager.py", line 787, in check_timeouts
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/golem/core/common.py", line 147, in func_wrapper
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/apps/rendering/task/framerenderingtask.py", line 131, in computation_failed
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/apps/rendering/task/framerenderingtask.py", line 338, in _update_frame_task_preview
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/apps/rendering/task/framerenderingtask.py", line 452, in __mark_sub_frame
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/apps/rendering/task/framerenderingtask.py", line 354, in _open_frame_preview
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/.venv/lib/python3.6/site-packages/PIL/Image.py", line 2519, in open
OSError: cannot identify image file '/Users/a/Library/Application Support/golem/default/mainnet/res/64301ada-aae1-11e8-837b-7c3ec369ac11/tmp/current_task_preview1.PNG'
018-08-28 18:53:38 ERROR    golem.client                        task_server.sync_network failed
Traceback (most recent call last):
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/golem/client.py", line 1485, in _run
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/golem/task/taskserver.py", line 140, in sync_network
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/golem/task/taskserver.py", line 881, in __remove_old_tasks
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/golem/task/taskmanager.py", line 787, in check_timeouts
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/golem/core/common.py", line 147, in func_wrapper
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/apps/rendering/task/framerenderingtask.py", line 131, in computation_failed
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/apps/rendering/task/framerenderingtask.py", line 338, in _update_frame_task_preview
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/apps/rendering/task/framerenderingtask.py", line 452, in __mark_sub_frame
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/apps/rendering/task/framerenderingtask.py", line 354, in _open_frame_preview
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/.venv/lib/python3.6/site-packages/PIL/Image.py", line 2519, in open
OSError: cannot identify image file '/Users/a/Library/Application Support/golem/default/mainnet/res/64301ada-aae1-11e8-837b-7c3ec369ac11/tmp/current_task_preview8.PNG'
2018-08-28 18:54:00 ERROR    golem.client                        task_server.sync_network failed
Traceback (most recent call last):
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/golem/client.py", line 1485, in _run
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/golem/task/taskserver.py", line 140, in sync_network
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/golem/task/taskserver.py", line 881, in __remove_old_tasks
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/golem/task/taskmanager.py", line 787, in check_timeouts
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/golem/core/common.py", line 147, in func_wrapper
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/apps/rendering/task/framerenderingtask.py", line 131, in computation_failed
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/apps/rendering/task/framerenderingtask.py", line 338, in _update_frame_task_preview
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/apps/rendering/task/framerenderingtask.py", line 452, in __mark_sub_frame
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/apps/rendering/task/framerenderingtask.py", line 354, in _open_frame_preview
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/.venv/lib/python3.6/site-packages/PIL/Image.py", line 2519, in open
OSError: cannot identify image file '/Users/a/Library/Application Support/golem/default/mainnet/res/64301ada-aae1-11e8-837b-7c3ec369ac11/tmp/current_task_preview4.PNG'
018-08-28 18:54:03 CRITICAL twisted                             An exception was raised from application code while processing a reactor selectable
Traceback (most recent call last):
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/.venv/lib/python3.6/site-packages/twisted/python/log.py", line 103, in callWithLogger

  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/.venv/lib/python3.6/site-packages/twisted/python/log.py", line 86, in callWithContext

  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/.venv/lib/python3.6/site-packages/twisted/python/context.py", line 122, in callWithContext

  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/.venv/lib/python3.6/site-packages/twisted/python/context.py", line 85, in callWithContext

--- <exception caught here> ---
  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/.venv/lib/python3.6/site-packages/twisted/internet/kqreactor.py", line 293, in _doWriteOrRead

  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/.venv/lib/python3.6/site-packages/twisted/internet/tcp.py", line 205, in doRead

  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/.venv/lib/python3.6/site-packages/twisted/internet/tcp.py", line 211, in _dataReceived

  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/golem/network/transport/tcpnetwork.py", line 340, in dataReceived

  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/golem/network/transport/tcpnetwork.py", line 366, in _interpret

  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/golem/task/tasksession.py", line 123, in interpret

  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/golem/network/transport/session.py", line 76, in interpret

  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/golem/task/tasksession.py", line 485, in _react_to_want_to_compute_task

  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/golem/task/taskmanager.py", line 396, in get_next_subtask

  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/apps/blender/task/blenderrendertask.py", line 484, in query_extra_data

  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/apps/rendering/task/framerenderingtask.py", line 338, in _update_frame_task_preview

  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/apps/rendering/task/framerenderingtask.py", line 452, in __mark_sub_frame

  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/apps/rendering/task/framerenderingtask.py", line 354, in _open_frame_preview

  File "/Users/buildbot-worker/worker/buildpackage_macOS/build/.venv/lib/python3.6/site-packages/PIL/Image.py", line 2519, in open

builtins.OSError: cannot identify image file '/Users/a/Library/Application Support/golem/default/mainnet/res/64301ada-aae1-11e8-837b-7c3ec369ac11/tmp/current_task_preview3.PNG'

from res folder:

current_task_preview1

current_task_preview3

current_task_preview4

current_task_preview8

current_task_preview9

Elfoniok commented 6 years ago

This bug is caused because preview file is handled in two separate threads.

/golem/task/taskserver.py", line 140, in sync_network /golem/task/taskserver.py", line 881, in __remove_old_tasks /golem/task/taskmanager.py", line 787, in check_timeouts /golem/core/common.py", line 147, in func_wrapper /apps/rendering/task/framerenderingtask.py", line 131, in computation_failed /apps/rendering/task/framerenderingtask.py", line 338, in _update_frame_task_preview

and

/golem/network/transport/tcpnetwork.py", line 366, in _interpret
/golem/task/tasksession.py", line 123, in interpret /golem/network/transport/session.py", line 76, in interpret /golem/task/tasksession.py", line 485, in _react_to_want_to_compute_task /golem/task/taskmanager.py", line 396, in get_next_subtask /apps/blender/task/blenderrendertask.py", line 484, in query_extra_data /apps/rendering/task/framerenderingtask.py", line 338, in _update_frame_task_preview

Preview file can be corupted this way, and PIL yields that it can not identify image file.

In general removing tasks in Golem is very poorl handled because of python easier to ask for forgiveness than permission guidline, instead of ensuring that it is safe to delete task/subtask we just delete it and then handle lot of errors that it causes everywhere in datastructures of Golem.

We can go that way further and just provide proper locking of this file, or make it right this time.

shadeofblue commented 5 years ago

hmm... maybe the more interesting question is: why does __remove_old_tasks touch the same task that another node just issued a WantToComputeTask for (which suggests the task is still being computed)...

etam commented 5 years ago

hmm... maybe the more interesting question is: why does __remove_old_tasks touch the same task that another node just issued a WantToComputeTask for (which suggests the task is still being computed)...

Because it's marking a subtask as timed out. Yeah, it's not something you wold expect in __remove_old_tasks.

etam commented 5 years ago

So, every LoopingCallService runs in a separate thread. Can we move to Rust quicker? There a data race is a compilation error.

etam commented 5 years ago

We're fixing a bigger issue here: #3660