Editing LAN job on machine other than the one it was created on causes 500 error

simoncrabbuk commented 1 year ago

If one tries to edit a LAN job on a machine that didn't create the LAN job an API 500 Error is thrown.

To reproduce:

Create LAN job on machine A Edit LAN job on machine B Error 500 thrown

Expected behaviour

Either jobs shouldn't be able to be edited on all machines, or jobs should be able to be edited on all machines!

Consider current behaviour creates a Master Slave type arrangement, if the Master goes down nobody can manage the LAN jobs. Also can appreciate the need for files to be copied everywhere if editing on all machines is required.

Perhaps Master, Backup Master, Slaves is a reasonable solution - if Master goes down Backup Master can also edit LAN jobs. Slaves never able to edit LAN jobs, only service them.

smartin015 commented 1 year ago

Hey Simon, thanks for reporting - this is definitely a bug, as LAN jobs should be editable on any instance connected to the queue (definitely trying for masterless here, although sometimes there can be weird sticking points).

Can you upload a sysinfo bundle so I can see your logs?

simoncrabbuk commented 1 year ago

Here you go.

octoprint-systeminfo-20221117084028.zip

I've seen a similar error on a different 500 bug too, just when trying to load the main screen state/get & history/get - main screen shows no queues, may be the same or different bug

2022-11-17 08:39:12,678 - octoprint - ERROR - Exception on /plugin/continuousprint/job/edit [POST] Traceback (most recent call last): File "/home/pi/oprint/lib/python3.7/site-packages/flask/app.py", line 2077, in wsgi_app response = self.full_dispatch_request() File "/home/pi/oprint/lib/python3.7/site-packages/flask/app.py", line 1525, in full_dispatch_request rv = self.handle_user_exception(e) File "/home/pi/oprint/lib/python3.7/site-packages/flask/app.py", line 1523, in full_dispatch_request rv = self.dispatch_request() File "/home/pi/oprint/lib/python3.7/site-packages/flask/app.py", line 1509, in dispatch_request return self.ensure_sync(self.view_functions[rule.endpoint])(req.view_args) File "/home/pi/oprint/lib/python3.7/site-packages/octoprint/server/util/flask.py", line 1575, in decorated_view return no_firstrun_access(flask_login.login_required(func))(*args, *kwargs) File "/home/pi/oprint/lib/python3.7/site-packages/octoprint/server/util/flask.py", line 1598, in decorated_view return func(args, kwargs) File "/home/pi/oprint/lib/python3.7/site-packages/flask_login/utils.py", line 272, in decorated_view return func(*args, *kwargs) File "/home/pi/oprint/lib/python3.7/site-packages/continuousprint/api.py", line 67, in cpq_permission_wrapper return func(args, **kwargs) File "/home/pi/oprint/lib/python3.7/site-packages/continuousprint/api.py", line 211, in edit_job q = self._get_queue(data["queue"]) File "/home/pi/oprint/lib/python3.7/site-packages/continuousprint/plugin.py", line 682, in _get_queue return self.q.get(name) AttributeError: 'CPQPlugin' object has no attribute 'q'

smartin015 commented 1 year ago

Thanks for the bundle - the errors you see there are actually followups from an earlier error in initialization:

2022-11-15 11:15:49,811 - octoprint.plugins.continuousprint - INFO - Starting fileshare with address 192.168.1.102:0
2022-11-15 11:15:49,814 - octoprint.plugin - ERROR - Error while calling plugin continuousprint
Traceback (most recent call last):
  File "/home/pi/oprint/lib/python3.7/site-packages/octoprint/plugin/__init__.py", line 273, in call_plugin
    result = getattr(plugin, method)(*args, **kwargs)
  File "/home/pi/oprint/lib/python3.7/site-packages/octoprint/util/__init__.py", line 1688, in wrapper
    return f(*args, **kwargs)
  File "/home/pi/oprint/lib/python3.7/site-packages/continuousprint/__init__.py", line 66, in on_after_startup
    self._plugin.start()
  File "/home/pi/oprint/lib/python3.7/site-packages/continuousprint/plugin.py", line 84, in start
    self._init_fileshare()
  File "/home/pi/oprint/lib/python3.7/site-packages/continuousprint/plugin.py", line 252, in _init_fileshare
    self._fileshare.connect()
  File "/home/pi/oprint/lib/python3.7/site-packages/peerprint/filesharing.py", line 111, in connect
    self.httpd = FileshareServer((self.host, self.port), FileshareRequestHandler)
  File "/usr/lib/python3.7/socketserver.py", line 452, in __init__
    self.server_bind()
  File "/usr/lib/python3.7/socketserver.py", line 466, in server_bind
    self.socket.bind(self.server_address)
OSError: [Errno 99] Cannot assign requested address

This is a failure to assign an address to the LAN fileshare - a different error than #154, but related in that it's also a networking error that causes ambiguous failures on startup. I'll look into hardening init code so we can display the error properly at least.

The address used by the fileshare seems to float around - do you have multiple network interfaces set up on your pi 4?

simoncrabbuk commented 1 year ago

Rapid investigation, I'm impressed!

I don't have multiple network interfaces, no, should just be one IP per pi... I did go from auto to specifying it to see if that would help.

And I've found that I have to recreate queues sometimes to get it to wake back up!

Happy to test, should I move to an rc release channel?

smartin015 commented 1 year ago

Absolutely - please switch over to RC. Much appreciated!

I've got a few improvements in place with #155 that may fix the startup issues for you. Aiming to push that to RC in a few days after I do some UX testing of other changes going out in 2.3.0. Once that's resolved, I can get back to the original issue you reported :)

simoncrabbuk commented 1 year ago

I've been testing on 2.3.0-rc1 for a while with 4 machines on the same LAN queue.

It appears you can now change job settings on any machine, however if the "master" disappears, I find some weirdness, it appears the other machines don't continue printing the next task in the job, I'm not 100% sure the exact details.

I've not had any 500 errors since rc1 though :-D

smartin015 commented 1 year ago

Thanks for the testing. I'm glad the changes sorted out the 500's and the job editing issues... the master weirdness is a bit concerning, definitely something I'll look at after I've landed auto-slicing. As in the other bug, I'm hoping my refactors of PeerPrint will give us a more stable LAN (and also WAN!) experience.

Closing this out as the original behavior is now resolved; please open up a new issue if you find anything tangible with master connection issues :)

simoncrabbuk commented 1 year ago

Hello! I've just tested this again and it threw a 500 again, just running two machines in the LAN queue today, one master that I created the LAN job on, and another which is running the job. I went to edit the job on the slave machine, just to change quantity and it threw a 500 :-( Edited it on the master no problem.

octoprint-systeminfo-20221212101537.zip

smartin015 commented 1 year ago

Guess there's more work to do then. Thanks for the bundle, will take a look when holiday craziness dies down a bit :)

smartin015 commented 1 year ago

Found the culprit exception:

2022-12-12 10:14:42,053 - octoprint - ERROR - Exception on /plugin/continuousprint/job/edit [POST]
Traceback (most recent call last):
  File "/home/pi/oprint/lib/python3.7/site-packages/flask/app.py", line 2077, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/pi/oprint/lib/python3.7/site-packages/flask/app.py", line 1525, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/pi/oprint/lib/python3.7/site-packages/flask/app.py", line 1523, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/pi/oprint/lib/python3.7/site-packages/flask/app.py", line 1509, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/home/pi/oprint/lib/python3.7/site-packages/octoprint/server/util/flask.py", line 1575, in decorated_view
    return no_firstrun_access(flask_login.login_required(func))(*args, **kwargs)
  File "/home/pi/oprint/lib/python3.7/site-packages/octoprint/server/util/flask.py", line 1598, in decorated_view
    return func(*args, **kwargs)
  File "/home/pi/oprint/lib/python3.7/site-packages/flask_login/utils.py", line 272, in decorated_view
    return func(*args, **kwargs)
  File "/home/pi/oprint/lib/python3.7/site-packages/continuousprint/api.py", line 86, in cpq_permission_wrapper
    return func(*args, **kwargs)
  File "/home/pi/oprint/lib/python3.7/site-packages/continuousprint/api.py", line 236, in edit_job
    return json.dumps(q.edit_job(data["id"], data))
  File "/home/pi/oprint/lib/python3.7/site-packages/continuousprint/queues/lan.py", line 324, in edit_job
    jid = self.import_job_from_view(j, j.id)
  File "/home/pi/oprint/lib/python3.7/site-packages/continuousprint/queues/lan.py", line 259, in import_job_from_view
    raise ValidationError(err)
continuousprint.queues.lan.ValidationError: validation for job fingerboard failed - file not found at fboard cut 1h30m 19g.gcode (is it stored on disk and not SD?)

Looks like for reasons currently unknown, the remote .gjob file isn't getting pulled to the client machine on which you're making the edit. There's probably a missing step somewhere in the flow there.

Things to verify on a fix:

[x] job submitted by host is printable by host
[x] job submitted by host is printable by peer
[x] job submitted by host, edited by host is printable by host & peer
[x] job submitted by host, edited by peer is printable by host
[x] job submitted by host, edited by peer is printable by peer

smartin015 commented 1 year ago

Okay, 2.3.0rc2 is ready with an attempted fix. @simoncrabbuk, mind testing this candidate on your setup when you have a moment?

simoncrabbuk commented 1 year ago

Not 100% sure but I think I saw the 500 error when trying to edit jobs on a slave, about maybe 16 hours before this system dump was taken. octoprint-systeminfo-20221222094805.zip

Also later in the dump you'll see some behaviour around the job not continuing on the slave if the master disappears. Perhaps the same bug, the .gjob not getting to where it needs to be?

smartin015 commented 1 year ago

Looks like that dump has ~18h of logs, but the only error I see is related to the one you mentioned of clients not seeing files when the host goes offline (i.e. continuousprint.queues.lan.ValidationError: Cannot resolve set {path} within job hash {hash_}; peer state is None)

Currently, there is no data denormalization across peers in the queue before a print job needs to be fetched. If a host of a file happens to be down when that fetch is attempted, the client will error and abort when it fails to fetch the file. For better or worse, the current queue implementation is intended for farms where it is abnormal for octoprint hosts to be offline.

I plan to improve reliability in the refactor of peerprint by using IPFS rather than basic file hosting. With IPFS, files can be distributed quickly and easily and I don't have to relearn the lessons the IPFS devs have already learned :)

smartin015 / continuousprint

Editing LAN job on machine other than the one it was created on causes 500 error #151