meerk40t / meerk40t

Hackable Laser software for the K40 Stock-LIHUIYU laser boards.
MIT License
220 stars 58 forks source link

Writer Tweaker Dialog. #110

Closed tatarize closed 3 years ago

tatarize commented 4 years ago

The writer can tweak various properties on the fly and view the writer's current state. Allowing a person to tweak the processes already in action. This could also permit easy things like, pop out of compact mode, and clear the spooler. In case you wanted to effectively abort a job in progress, in a more extreme fashion than merely clearing the additional things in the spooler that would go after that.

joerlane commented 4 years ago

I think this might also allow a cranky board reset? I've had trouble with that, even after flipping the power switch. Not sure how to send you an offending job. I believe you might have described seeing the same issue recently when using hardware?

tatarize commented 4 years ago

Yeah, basically the board has compact mode which specifically runs only at that one speed. You need to exit that mode by sending a "FNSE" and waiting for the board to send a special, finished code. Then you are out and can perform the normal operations. Including returning to a speed based mode. If you aborted in a speed mode there's currently no way to correct this. And because of the nature of the mode you can only send complete packets, so basically if the laser is turned off while in that mode, you basically desynced and no longer have a solid guess on the board state. So you can't really control it. Basically takes turning off the laser and killing MeerK40t to rectify it, as things currently stand.

The boards themselves don't lose their state with the USB connection either. You can set it in compact mode, wait an hour, then send more data and the board is still in compact mode. So you absolutely need to track the state. And when the board is power cycled you have a wrong track of the current state. You'd need to force the writer a to have the right state. This is compounded by the controller's queue being able to store fragments of packets. That data might have made since with the boards previous state. But, after a power cycle, the data its forced to send next is basically senseless and can't possibly run. You'd again need to be able to externally set the state to be correct. Or invoke a command that could force the device state into something knowable.

joerlane commented 4 years ago

I'm not sure if this is immediately helpful or not. It appears I can reproduce the problem by hiding the app in the background during a job. At this point I can't really get it to do anything. Pausing and resuming has no effect in software.

If I were able to manually send commands through MeerK40t at this point; I would be able to recover and continue? Or did I read that wrong?

Screen Shot 2020-01-16 at 6 30 55 PM

tatarize commented 4 years ago

Evidently your last packet set told it to perform a wait command. the FNSEFFFF... command means finish out the queue. The program itself is supposed to wait until it sees a FINISHED command from the controller. If it somehow happened and never saw that command, it would absolutely get stuck there. Since the program has no idea otherwise if it's otherwise safe to move along, manually telling it that it's safe to move along would recover there.

That would be easily recovered, since it's actually MeerK40t stuck in that case and not the controller. That could just be a bug there, and telling the controller to move along is entirely recoverable. Not sure why it would get stuck there, but it's certainly plausible, if the finish command never showed up. Or if I goofed somewhere.

The data there should be in the queue and not the buffer, since it's intending to bottom out the controller so that it can switch modes to something else.


Also, and it's a bit out of left field, I added controller return bytes to the view but I've never actually knew what they did, but strangely they are different for you. My numbers are almost always: "255, x, 111, 12, 18, 255". I did have them send: "255, x, 8, 18, 0, 255" for a little bit. But, really seeing your numbers is weird. They are somehow different between machines, and sometimes different when running here or there. What the hell could they mean?

tatarize commented 4 years ago

Added in the hook to force it to abort waiting assuming it's what it kinda looks like it might be. The easiest method to prove this would be to get run it in debug in PyCharm, then you could pause the operation while it's running when it does that. And see if the pause spot in the thread running the controller is stuck in that code for wait().

The hooks to just force events in the code are great, and easy to register with the Kernel but I might need to to figure out some way to force them as an event. I guess maybe just add a menu somewhere. Lemme see if my current version is working enough to even release to the working branch, and add that menu somewhere.

joerlane commented 4 years ago

I did it again and left it open on my computer. I think you're right that it is a MeerK40t bug. I've seen a similar freakout, but where the spooler continues to drain until empty; and this is not at all the same.

255, 206, 111, 48, 19, 255: I'm at least half as confused about this difference as you are. I know there are differences between mac, unix, and dos text file formats; but I don't see how either 48, or 19 would be involved in that. Macs have for about 20 years now been using unix style line endings, so there should really only be those two differences. Again; not sure that matters here, but it was the only thing I could think of, and 10 is indeed Line Feed.

This one doesn't have a FNSEFFFF?

Just downloaded PyCharm but haven't used it before. This is something I am able to reproduce now by loading a few raster jobs, firing it up, and then I minimize the application. Seems to hang up pretty quickly after that point; yet I've run this or similar jobs with application in focus and it works fine. Sometimes if I have a web browser in front of MeerK40t it will also hang; but not every time.

Screen Shot 2020-01-17 at 1 36 19 AM

tatarize commented 4 years ago

Wel I pushed a new version on the branch and while it might be a bit weirder since I'm still moving stuff around. It will have in the Controller, the ability to right click which will let you do a forced operation and K40-Wait Abort should be in there. Which would be a method of testing the stuff.

Couple new and fun bugs, since it's ticking along it ignores extra ticks that occur before the messaging thread cares about them, so the drawing line thing in the scene will have little fragments of blue lines since it will have lost a few ticks worth since it's just there to notify that they existed rather than try to force the OS to refresh for basically most of that.

tatarize commented 4 years ago

That packet is even stranger. What the hell would possess it to send a packet consisting of only 'F' commands. They are generally either Finish or what LaserDrw used to pad their packets.

tatarize commented 4 years ago

Those numbers are unrelated to your OS. They actually come from the M2 Nano. They are the read data from the controller. So they mean something about what that board is doing. Byte 1 there is translated for you, it's like OK, BUSY, POWER, etc.

tatarize commented 4 years ago

This one says USB Not Found, is it maybe doing something weird with the USB. You could check the log on that.

joerlane commented 4 years ago

I got PyCharm up and running; but it's not looking at the correct python install on my computer. As soon as I figure that out, I can give you a better idea from the debugger. It hung up on missing all of the requirements, pyusb, PIL, and wx.

That's even more strange; the K40 sends different replies? Does it change if I use a different computer, or is it really OS specific?

tatarize commented 4 years ago

PyCharm sets up a venv (virtual environment) for each program you coding up. The point is you might have different things installed on one or run a different version or need to switch what version you're using really quickly and it's much better to duplicate that and run it fairly sandboxed.

That's even more strange; the K40 sends different replies? Does it change if I use a different computer, or is it really OS specific?

I honestly have no idea. Byte 1 is status information and is wildly needed. The rest were there too so I added them to the dialog. Since maybe they did something cool that Scorch just hadn't figured out. But, they all basically sat there doing nothing, so I didn't think too much of it. I had noticed they changed for me briefly when I was rastering something yesterday. Which is why I noticed your Byte 3 and Byte 4 were pretty dramatically different. I still don't know what they do. But, it's kinda interesting. And if they do something useful that would be all the better. I did check and they don't seem to change based on end-switch hitting.

joerlane commented 4 years ago

I rebuilt a similar job and it didn't hang up yet; never mind. It's starting its antics now. and..... Wait for it.... Boom: Took a lot longer this time, though I don't know why.

I wasn't running a real burn, so I'm not sure about the output; but K40-Wait Abort resumed the job successfully.

Screen Shot 2020-01-17 at 2 34 19 AM

joerlane commented 4 years ago

The machine just Homed itself after finishing the job. When I tried to quit MeerK40t it started spinning at me (crashed). It probably still thinks the job is running? It didn't so much as complain this time; but did hang up around the time I would expect it to report as much.

tatarize commented 4 years ago

Hm. It doesn't seem like it should lose that control code even if it did manage to hang. It seems like there would need to be a USB problem that results from lower resources such that it somehow can't successfully read the usb data. -- That doesn't really make much sense.

I did tweak a lot of stuff with the GUI as such to try to make it not hang the OS so maybe that's part of the time to hang thing. It's why I went from this is just about ready to publish, to let me tear this thing apart something fierce. I'll look at the code some more. Maybe the slowdown is causing a race condition or something, but there really is only the one thread there so it shouldn't do anything weird. And even if OSX went about reducing resources for background processes it shouldn't really do that.

Yeah, the latest version is screwy with threads. So it basically will keep trying to run after you close it. Until I find all the threads and get them all to close down correctly. Though, I theory, the Emergency Stop button should work fairly well now. Though I have a crash from that that I didn't commit fixed yet.

tatarize commented 4 years ago

Nah, it was committed. In theory E-Stop should purge the buffers, the spooler, and hopefully tell the machine itself to abort. Like the I commands should actually make it disregard the buffer.

joerlane commented 4 years ago

I can try turning off the refresh of the GUI? I also think leaving the controller window open is enough to keep it from hanging up. I'll try that too. It will usually delay repeatedly, with increased frequency right before it finally full stops like that.

I'm running another job (since it had crashed I couldn't use the "same" one) just behind my web browser. I'll continue to burry it deeper; one test at a time until I can re-create. It's not hard to do... Just takes a few minutes.

tatarize commented 4 years ago

It sounds like the OS is suspending the background process. Like giving the job less and less resources, until it can't really refresh anymore. Then it might bog down some or crash. I would assume if the newer version I did made it work a bit better that it might be able to fix it completely. In theory the message queue thread is calling to the UI thread to process all the signals because they might need to update the OS. If I wrap it back around so it will never queue up another request to process the signals until after it finished the first one then it shouldn't be able to spawn more threads to delegate to the GUI.

joerlane commented 4 years ago

So it completed behind my browser. I started it again; hid the app, and it began to hick-up. I brought it back to front and set the Do Not Refresh from the menu. After about the same amount of time it begins to hick-up again. If I wait long enough it will always hang up once that begins. Same though; K40-Wait Abort continues the job. I see an FNSEFFFF in this one again.

Screen Shot 2020-01-17 at 2 52 57 AM

So I left the controller open and hid the app; it's again dying. One long pause after another... This time spooler is empty, but it didn't home. It's "hung" again.

Screen Shot 2020-01-17 at 2 56 18 AM

After K40-Wait Abort it homes, and all is well.

Screen Shot 2020-01-17 at 2 57 10 AM

Basically; I don't see the GUI being the issue either way. But I'll have to get the debug setup.

joerlane commented 4 years ago

I think you are correct; that it appears to be chocked out by the OS. I can hear it pause repeatedly and bringing it back to focus fixes it.

The new pull is a lot better already; and I can simply kick it back off anytime it has an issue.

joerlane commented 4 years ago

Confirmed; that bringing the application back to focus will clear up any delays, and keep the job from halting.

I'd say this is fixed enough for now. I know I can resume a job if one fails. If this can be rolled into a check on MeerK40t performs on itself; that wouldn't hurt, but I can use this without worrying about a job failure.

tatarize commented 4 years ago

Okay, try the latest commit. It should prevent the slow downs from forcing a thread pileup.

https://github.com/meerk40t/meerk40t/commit/2574cdf5f9719e3c0de94256f1a81d5eb2d2731d

tatarize commented 4 years ago

Basically it's delegating a thread to process the batch of signals, but it could be dealing with the signals slow enough that if things slow down, it'll start queuing up a larger and larger number of them that will bottle up and start resource hogging. With this change, here, it won't be allowed to do that. If the one thing dealing with the signals isn't finished, asking for a second is strictly disallowed. In theory, even with a trickle of resources it shouldn't cause a runaway.

joerlane commented 4 years ago

I scripted the whole pull/make process real quick; so it now rebuilds a Mac app every time for me.

Just started another job and sent it to the background.

tatarize commented 4 years ago

Okay, good because "I'd say this is fixed enough for now. I know I can resume a job if one fails. " doesn't completely apply given I don't think I could reproduce the error myself. Unless Linux does it too. And jobs should be allowed in the background. You need to watch the laser not necessarily the screen's GUI.

tatarize commented 4 years ago

And reproduction is the first thing to fixing stuff. Also, I'm sort of jazzed about that code. It seems like it'll scale nicely with the amount of resources and just start skipping kind of pointless updates when they aren't extremely needed. And some of my testing found a similar runaway and this just isn't entirely fixed yet.

joerlane commented 4 years ago

New pull and it still happens if I hide the app. Same situation if I bring it back to front every time it stutters; it will "refresh" itself, and continue a little longer.

Same deal though; don't waste too much time on something I can avoid entirely. But at some level; shouldn't the app know it's been sitting here for 10 minutes now and the spooler isn't flowing? Is it dangerous to kickstart it automatically?

tatarize commented 4 years ago

You actually could make an operation that could take a really long time to complete. I was running tests and made some that took ten minutes, easily. You just need to draw a very large line very slowly. I'm guessing the resources aren't running away, the code change is pretty good still. I might have roughly duplicated it by setting the priority and low and running some really processor intensive operations. It started flickering and going black. Maybe it's reproducible.

joerlane commented 4 years ago

Yes; that's how it starts. Just keep going like that and it will "amplify". It gets worse and worse until it finally just halts.

I found instructions to setup PyCharm so we might meet in the middle here.

tatarize commented 4 years ago

Gave it some resources again and it started right up when it stopped being livelocked. It should be fine to lose resources, but it should not halt at the wait(finish) bit of code. It should start stuttering and delaying but it shouldn't break at that point. That's what's really strange there. Like maybe the USB is timing out on the read it needs, so the controller lost the state somewhere. It's really supposed to send that flag.

joerlane commented 4 years ago

Screen Shot 2020-01-17 at 3 36 29 AM

tatarize commented 4 years ago

The correction I made should have prevented the amplification. Though not the resource loss. So it might be corrected and just not correct the resource loss the amplify part was what should have been fixed.

tatarize commented 4 years ago

Yeah, if you run from that with the debug button you can do some cool things like see where the code stops when you hit the pause button. So if the thread ever gets lost somewhere strange you can at least identify the code segment. And the debug functions for stopping code at various places is quite useful. Even without using it for coding purposes it's great for debugging.

joerlane commented 4 years ago

Screen Shot 2020-01-17 at 3 41 35 AM

tatarize commented 4 years ago

That's the holding pattern for when the backend is full and has stopped sending data, enough that the buffer max limit was hit. In theory you could hit the check box for getting rid of the buffer, it would finish making the entire dataset then only have the other end busy trying to send stuff. Which might fall into a different holding pattern.

You can hit resume in the debug tab at the bottom and pause it some more. There's other threads in other places.

tatarize commented 4 years ago

Or not, tried the checkbox there, gave me a crash error.

joerlane commented 4 years ago

This is the other place I saw it stop. There are times I can't seem to pause it though. I can click to my hearts content; but it's "frozen".

Screen Shot 2020-01-17 at 3 48 50 AM

Then I realized I was running a copy of the code in PyCharm; so I fixed that. Created a new project and cloned the repo inside the folder, added libs to venv; and started again. Hopefully something more useful comes up this time. It's already starting to stutter... Helps to hide PyCharm too.

Looks about the same so far. There are times I simply can't pause PyCharm's debugger. These are the four places I can get it to stop: Screen Shot 2020-01-17 at 4 03 02 AM Screen Shot 2020-01-17 at 4 03 17 AM Screen Shot 2020-01-17 at 4 03 37 AM Screen Shot 2020-01-17 at 4 03 53 AM

joerlane commented 4 years ago

If I turn off the buffer limit I get a different stop point in the mix. And the buffer clearly isn't empty yet.

Screen Shot 2020-01-17 at 4 08 14 AM

Screen Shot 2020-01-17 at 4 07 15 AM

tatarize commented 4 years ago

Yeah, the beep command sends a command to wait until the buffer is empty. Since it's supposed to beep when the job is finished not when the writer is done making code. It really does look like it's just missing that status update that says "finished" and is thus forced into waiting for a device that isn't actually doing anything to finish doing what it's doing.

I don't however see how that relates to the GUI getting bogged down. Unless somehow that going to make the USB miss some signal read packet. Just kinda headscratching. The controller missing that thing it shouldn't miss. I guess I could make some code to automatically assume it's finished after a timeout. Or have the wait_finished code timeout itself. Though if you were cutting a 5 inch by 5 inch square out, at like 5mm/s, you'd be looking at 101 seconds. And the command for go right 5 inches is literally like Bzzzzzzzzzzzzzz155 or something (each z = 255 1/1000th of an inch, other letters equal their given value a=1, b=2, etc numbers are less than 25 and taken literally). That entire square could easily fit in your buffer, so how long is long enough for a timeout?

joerlane commented 4 years ago

This is one I'm not sure about. Now that I can debug; there are other things you can try I think...

From what I gather; you are saying that if I were to send a command to drive the laser slow enough; I could be waiting 10 minutes between a single transaction with the laser cutter, in this same holding pattern in code? How COULD software know that it it's really hung up like this; and not just doing its job?

How would one implement a keep alive? Can you ask the controller where it is right now; while it's still busy? I need to go back and read the specs again on this. Just throwing out some ideas off the cuff. A microphone could pretty easily indicate if the laser cutter decided to stop... I'm not sure how else to approach that problem.

I failed to mention; booted up my Pi 3, pulled the latest and it said wx wasn't installed. I thought that was weird, and asked the setup.py utility to install again... It decided to compile the source (again). At least it's a Pi 3 and that only takes 24 hours as opposed to 96 hours. I should have no trouble stressing a Pi to the same point of conflict once I get up to speed again.

joerlane commented 4 years ago

Tried to quit from the program that I was running in PyCharm debugger after clearing a job with K40-Wait Abort twice. The only thing I can see is this... It's hung in a crashed state just like when I quit my compiled OS X app. I can't seem to get it to pause anywhere else in the code.

Screen Shot 2020-01-17 at 4 49 42 AM

tatarize commented 4 years ago

No. You can't. The only way to know it's done is by catching that one finished flag. If it's still running and you send an I command it should kill what it's currently doing and stop right away. You might be able to take an educated guess if the controller is flashing BUSY/OK while you're querying it, but I'm not at all sure that'd work either.

In fact, Whisperer can't even queue jobs like my code does generally since the it can't queue that the F, wait, command procedure. Though there is a different method where you reset the mode, you'd be unable to pad the packets though in that mode.

Also, there might be some magical other way to query such a thing, that is simply unknown since LaserDrw never performed. It's entirely possible some other command on the board does something wildly useful that fixes the problem but the command is just unseen and thus unknown.

Ah, that last fail there isn't too much of a concern. I've not yet finished coding it. It's the Scheduler for the kernel. The idea is other things should be able to set jobs to be done. Like check my temperature gauge every 5 seconds. And if above this value, perform this action. With it the code there you could just make a job for that without writing your own thread. It's just that currently it doesn't shutdown correctly. Less a bug and more incomplete there.

joerlane commented 4 years ago

Cool. So you're saying that I should build a sniffer; and brute force Probe the K40 board? That sounds like a reasonable idea... Probably cheaper, easier, and faster than de-capping the chip for inspection. Unless you know somebody; or know somebody that knows someone...

joerlane commented 4 years ago

Tried to set Zoom To Size from menu and got an error:

Traceback (most recent call last): File "meerk40t/MeerK40t.py", line 1217, in on_click_zoom_size File "meerk40t/MeerK40t.py", line 959, in focus_on_project File "meerk40t/LaserRender.py", line 256, in bbox AttributeError: 'LaserNode' object has no attribute 'box'

Since the recent updates; I believe all of my halt points have included an FNSEF in the packet text when frozen solid.

Otherwise; I can always NOT hide the app when it's running a job, and things work just fine.

Screen Shot 2020-01-17 at 5 38 22 AM Screen Shot 2020-01-17 at 5 40 32 AM

tatarize commented 4 years ago

https://github.com/meerk40t/meerk40t/commit/c98d5f101000097db61073ca643cd795c2957e1c Zoom To Size Fix.

Id kinda love to see the code for that chip. Might tell me what those other numbers mean.