luizribeiro / mariner

Web interface for controlling MSLA 3D Printers based on Chitu controllers, such as the ones by Elegoo and Phrozen.
MIT License
246 stars 65 forks source link

Mariner stopping print progress #311

Closed cbanderson22 closed 3 years ago

cbanderson22 commented 3 years ago

Hello there, I seem to be having an issue with Mariner stopping progress during a print. The interface resets to "printer status: ready", while the lcd shows the print in progress but with no way to resume printing where it left off. Log shows nothing except "started mariner3d printer controller".

Is this the Pi Zero restarting during progress? log

luizribeiro commented 3 years ago

@cbanderson22, I think we'll need more information here so we can debug this.

What is the printer you are using? What is the setup like? Are there any other symptoms of the issue?

cbanderson22 commented 3 years ago

Hey there, the printer I'm using is an Elegoo Saturn. I've tried it with 2 setups: both the main guide word for word and the following guide: https://www.reddit.com/r/ElegooSaturn/comments/lvne9u/saturn_rpi_mariner_samba/

I used external 5v power with the main guide and a usb cable with snipped 5v line, and usb 5v with the other guide. Pi Zero WH in both.

As for other symptoms, a couple of hiccups. The "unexpected response" messages that don't do anything.

Mariner constantly saying "invalid file: not found or empty" (or something along those lines) when I first upload a file. It would recognize the file eventually after a minute or so.

Once in a while I'd get a 400 error, but refreshing the page would get rid of that.

I was hoping the log would have more info, but that's all it had inside. I can attempt to run it again tomorrow after work once my current print is complete (using a thumb drive). Let me know if you need more information. Thanks!

cbanderson22 commented 3 years ago

Oh, then there's also the log itself showing completely wrong times on entries. Don't know if that would affect anything.

color65 commented 3 years ago

I have a similar problem with my Sonic Mini 4K. I can load files and start printing, but at the middle of the job, the printer freeze and the LED array is off. I can jog manually, when hit "Pause printing", the printer response, the Z axis move up, when hit "Resume printing", the platform go down to the layer position and nothing happened. The web interface show "Printer ready" This happened random, twice from around 20 -30 successful prints

Zenroth commented 3 years ago

Just had the exact same issue as color65 also with a sonic mini 4k. Came back from an overnight print and it was sitting at 71% complete with nothing happening. Pause would cause the z axis to move up, resume would move back down but nothing else would happen.

Any log entries/etc that would help? I assume Mariner only does read operations from the printer while printing? Though these do sometimes seem to trigger exceptions on the web interface. Completely off the cuff, as I haven't examined any of the code, but my initial thought would be something with the PI mounted volume hiccupping, and causing the printer to internally fail somehow.

hansale commented 3 years ago

Also have a Sonic Mini 4K, I’ve printed 8 or so files so far and had this happen last night. Got to 67% then the exact same issue as described by the other Sonic Mini 4K user.

luizribeiro commented 3 years ago

Completely off the cuff, as I haven't examined any of the code, but my initial thought would be something with the PI mounted volume hiccupping, and causing the printer to internally fail somehow.

@Zenroth, your guess seems very reasonable to me. You're right that mariner just reads the state. I would look at dmesg for anything fishy at around the time where the print stopped and other logs that would help debugging any Linux USB Gadget issues.

I would also try making the mount-point read-only to see if the problem goes away (though this would be just temporarily for debugging of course, otherwise you wouldn't be able to upload :)).

Though these do sometimes seem to trigger exceptions on the web interface.

You're probably referring to #180. If so, those are harmless. It's just mariner getting confused when there's a race between two printer responses.

luizribeiro commented 3 years ago

I'm also curious how you guys are powering the Pi with the Sonic Mini 4K. Are you using a buck converter? Powering from a pin on the printer's motherboard? Something else?

Zenroth commented 3 years ago

I'm also curious how you guys are powering the Pi with the Sonic Mini 4K. Are you using a buck converter? Powering from a pin on the printer's motherboard? Something else?

All of mine are powered with external power supplies so I doubt it would be a low voltage dip or something of that sort causing a blip/reset.

I’ve not ran into the issue again so far, but it’s been low amounts of printing for me of late. Will try to dig into the logs if/when it comes up again.

color65 commented 3 years ago

My Pi is powered from 24V line of the Sonic by buck converter

hansale commented 3 years ago

I'm also curious how you guys are powering the Pi with the Sonic Mini 4K. Are you using a buck converter? Powering from a pin on the printer's motherboard? Something else?

Currently I'm powering it with a USB power code plugged into the wall. I haven't switched over to the Buck converter, as I had to buy a 24v since the rail is stronger in the SM4K. I think that the failure to upload is important, as I had the same problem, where I got an upload error and file not found error, then just kept hitting upload and print until it printed. My spitball is that the file is corrupted in some way, as the printer is showing a blank screen, maybe it's indicating that that layer is empty? When I get a chance I could pull back down the corrupted file over SCP (I can't share it since it's copyrighted material), and open it up with Chitubox to check.

luizribeiro commented 3 years ago

@hansale, can't you simply compare the md5 hash of the two files to see if they are identical?

hansale commented 3 years ago

@hansale, can't you simply compare the md5 hash of the two files to see if they are identical?

Yes, that works too, except I deleted the file the other night 🤦🏻‍♂️

hansale commented 3 years ago

@luizribeiro the files are identical. Did an md5sum on the original and the copied file in the usb_share drive.

I had the same issue occur again (after a successful print). I tried a few things.

I tried copying the .ctb file to a usb drive, then tried pausing and resuming. This did not change the behavior (still paused, raised the build plate, came back down, then did nothing to continue printing).

I then did an scp of the file from mnt_share to my slicing computer, and the md5 results were identical to the generated file.

By the way thanks for your work on this, I really love this tool and want to help however I can.

hansale commented 3 years ago

I was thinking about what could be causing this, I run a space heater in my office that sometimes makes the lights flicker when it cycles on. It's possible that's causing an issue, maybe the pi is sensitive to the power draw? Curious if other people have the issue are using a space heater.

Another idea is possibly that I'm using a cheap microsd card from Microcenter, a 32GB. Maybe some sort of missed read or something strange is causing an issue that's getting the printer into a weird state?

Final idea is that I formatted 16GB of space on the microsd card for the usb_mount so I could keep more files on there if I wanted. Maybe this is causing issues?

Maybe we could make a ticket for adding a debug mode that logs out all the responses we're getting from the printer instead of discarding them, maybe that would help in troubleshooting so people could send us something from /var/log/mariner3d.log

cbanderson22 commented 3 years ago

I was thinking about what could be causing this, I run a space heater in my office that sometimes makes the lights flicker when it cycles on. It's possible that's causing an issue, maybe the pi is sensitive to the power draw? Curious if other people have the issue are using a space heater.

Another idea is possibly that I'm using a cheap microsd card from Microcenter, a 32GB. Maybe some sort of missed read or something strange is causing an issue that's getting the printer into a weird state?

Final idea is that I formatted 16GB of space on the microsd card for the usb_mount so I could keep more files on there if I wanted. Maybe this is causing issues?

Maybe we could make a ticket for adding a debug mode that logs out all the responses we're getting from the printer instead of discarding them, maybe that would help in troubleshooting so people could send us something from /var/log/mariner3d.log

Well, if it helps I can say I'm using very stable power, a branded SanDisk drive, and only using a 4gb partition.

Zenroth commented 3 years ago

I was thinking about what could be causing this, I run a space heater in my office that sometimes makes the lights flicker when it cycles on. It's possible that's causing an issue, maybe the pi is sensitive to the power draw? Curious if other people have the issue are using a space heater. Another idea is possibly that I'm using a cheap microsd card from Microcenter, a 32GB. Maybe some sort of missed read or something strange is causing an issue that's getting the printer into a weird state? Final idea is that I formatted 16GB of space on the microsd card for the usb_mount so I could keep more files on there if I wanted. Maybe this is causing issues? Maybe we could make a ticket for adding a debug mode that logs out all the responses we're getting from the printer instead of discarding them, maybe that would help in troubleshooting so people could send us something from /var/log/mariner3d.log

Well, if it helps I can say I'm using very stable power, a branded SanDisk drive, and only using a 4gb partition.

Using SanDisk 8gb sd cards here..

Power is an interesting one, should be fairly stable, has a regular 110v dc supply, but I do have heaters and air compressors that can kick on and cause a flicker. (Mostly the air compressor) Haven't seen it cause any issues with anything else though.

Zenroth commented 3 years ago

Just happened again on a print at 96%, nothing in dmsg for the last 9 days. (Last time I booted the raspberry pi), nothing in syslog or daemon, etc.

MD5 hashes of the particular file match between desktop and pi as well. The slices of it also seem fine and complete when examined.

Wondering if its related to polling get_print_status somehow? I feel like I've intentionally been avoiding leaving the status open or refreshing it on previous prints that haven't failed. This particular print I was checking it a few times, and here we are..

Zenroth commented 3 years ago

Looking at the API for print_status it seems like this causes a CTB file read each time to compute the current and remaining layers?

My reading up on g_mass_storage type devices, seems like there is problems with multiple devices accessing the mount at the same time potentially? (Seems odd if its only reads, but not sure if there is caching/etc going on here with the low level apis that get used by the printer/etc which could get janky perhaps.) This anyways strikes me as a potential timing issue, where the printer could be reading the data for the next layer, and the Mariner status update hits and attempts to read the same file. Or vice versa, Mariner is reading the file currently, and the printer tries to access it and it fails for some reason/etc.

If this is the case, it would also explain why I feel like I generally have success if I never have the print status window open during a print.

I'm going to try updating my local copy to only show current bytes and total bytes instead of layers. This byte data seems to be returned by m4000 over the serial connection, so it would mean only the printer should ever be touching the CTB file during a print.

Incidentally, I also realized that the more browsers open of the status, the more potential issues there could be at the moment, since there is no single thread/queue for print commands and responses. Each browser session seems like it would be trying to issue its own commands to the serial interface, read, flush, etc. (Unless there is some magic in the front end that I've missed.)

hansale commented 3 years ago

interesting, if this is the case we might be able to make it easily reproducible with some aggressive tests that blast the poor little pi with a million requests

Zenroth commented 3 years ago

I'm not a 100% sure (python isn't my strong area, web dev even less), but it seems like Flask is setup to cache the file read initially. However, I've caught an exception or two now on an unmodified version of Mariner (other than adding syslog logging) where the CTB read seems to fail, with a file not found, during a print_status update. Which seems like the cache failed or timed out? It didn't cause the printer to freeze this time.. but it is interesting.

Apr  2 20:56:29 raspberrypi mariner[2027]:   File "/opt/venvs/mariner3d/lib/python3.7/site-packages/mariner/server/utils.py", line 18, in read_                                                                   cached_ctb_file
Apr  2 20:56:29 raspberrypi mariner[2027]:     return CTBFile.read(FILES_DIRECTORY / filename)
Apr  2 20:56:29 raspberrypi mariner[2027]:   File "/opt/venvs/mariner3d/lib/python3.7/site-packages/mariner/file_formats/ctb.py", line 140, in                                                                    read
Apr  2 20:56:29 raspberrypi mariner[2027]:     with open(str(path), "rb") as file:
Apr  2 20:56:29 raspberrypi mariner[2027]: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/usb_share/house3_siriya.ctb'
Zenroth commented 3 years ago

So I caught the serial output on this current failure, very odd. The bytes seem to complete, and it switches to a pause state.. Then the total bytes as well as completed bytes change, and the pause state.. Yet the printer shows it at 93% and has our familiar pattern of behavior.

Print status continues to run forever, and remains reporting: Apr 3 05:27:52 raspberrypi /mariner: ok B:162/0 E1:159/0 E2:158/0 X:0.000 Y:0.000 Z:55.770 F:0/0 D:167678/0/0 T:0#015

Apr  3 05:24:52 raspberrypi /mariner: ok B:163/0 E1:159/0 E2:158/0 X:0.000 Y:0.000 Z:57.964 F:256/256 D:11161353/11165696/0 T:43289#015
Apr  3 05:25:51 raspberrypi /mariner: ok '/house3_siraya.ctb'#015
Apr  3 05:25:52 raspberrypi /mariner: ok B:164/0 E1:160/0 E2:159/0 X:0.000 Y:0.000 Z:89.835 F:0/0 D:11165696/0/1 T:0#015
Apr  3 05:26:51 raspberrypi /mariner: ok '/house3_siraya.ctb'#015
Apr  3 05:26:52 raspberrypi /mariner: ok B:163/0 E1:159/0 E2:159/0 X:0.000 Y:0.000 Z:114.834 F:0/0 D:40815/11378946/1 T:43358#015
Apr  3 05:27:51 raspberrypi /mariner: ok '/house3_siraya.ctb'#015
Apr  3 05:27:52 raspberrypi /mariner: ok B:162/0 E1:159/0 E2:158/0 X:0.000 Y:0.000 Z:55.770 F:0/0 D:167678/0/0 T:0#015
Zenroth commented 3 years ago

Just started the same print over, and this time total bytes is 11378946 (Which matches the correct size of bytes on the desktop file) instead of the previous 11165696?!?!

Also notice the output on my log posted, adjusts to show correct total bytes after the print breaks... Hmm

Zenroth commented 3 years ago

Certainly seems like some sort of sync issue with storage.

Upload a new file, check its byte size on the share via winSCP the size matches desktop file size. Go to print it, and its read short. Stop the print, restart it, size is correct..

Zenroth commented 3 years ago

So I've been running 3 printers mostly full time for the last 3 days. All are running my modified Mariner install to not attempt to read the CTB file each print status, and report bytes vs layers. (Though I'm less sure this is actually a problem after the previous findings)

I've noticed a handful of things:

  1. Uploading via the web interface has been very problematic for me in general. (lots of errors, or being very slow)

  2. About 50-60% of the time when I have uploaded via the web interface, and go to start a new print, the printer reported total byte size of the CTB file is short, and would result in our frozen print bug here. Hitting stop print, and then selecting the file to print again has so far 100% of the time worked in the correct total bytes being detected, and the print working fine.

  3. Because of the issues surrounding upload via the web interface for me, I've started just uploading files via WinSCP. This seems to be completely stable and fast for some reason. Print wise, it seems the printer generally doesn't see the file uploaded for 5-10 seconds (attempting to print immediately after an upload, will result in an error), but when the print is actually allowed to start the total bytes has been 100% correct.

I unfortunately have no idea why the python based upload and print start process seems to be problematic, or often the total bytes is reported wrong, causing this issue... I've dug through code, and was wondering if it might have been something as simple as the file write needing to be flushed, but most of all the file handling for uploads seems to be part of Flask and its dependencies, so I didn't get far.

I can say with my change in work flow and modification to Mariner I've successfully 100% eliminated this bug though in my testing.

I'll attempt reverting Mariner back to default on one of the machines, while using WinSCP uploads, and see if I come across the behavior again or not.

luizribeiro commented 3 years ago

I hit this for the first time today on my Elegoo Mars Pro (I haven't been printing much lately tbh, so not sure how frequent this is).

I don't think this is an issue with upload. The files on the Pi and on my computer were identical (like others reported). Printing the exact same file again resulted in a successful print too, which rules out upload issues.

I didn't have the mariner UI open on my browser, so mariner was definitely not sending any commands to the printer, as it only sends commands when the /print_status or the /printer/command/<command> endpoints are hit. So I don't see how this would be an issue within the Python code either.

I had this on my dmesg, so I might try disabling this service (but I don't have high hopes about it):

[4666268.662333] systemd-fstab-generator[16801]: Checking was requested for "/piusb.bin", but it is not a device.

I think it's more likely that there was a failure while reading from the USB mass storage gadget.

The printer probably fails to read a block from the device through USB, the printer's software doesn't handle that error well and just leaves the printer firmware in a weird buggy state. Maybe we need to play with some of the parameters of the g_mass_storage kernel module. It might be worth enabling debugging on the Linux kernel and bumping the log level to 7 to see if anything comes up.

Zenroth commented 3 years ago

Well another week of printing every day across multiple machines.

I'm not sure what causes it, but it does seem at least in my case if I upload a file via the Mariner interface and then go to print that file, often the total bytes read by the printer is less than it should be. (At which point the print will fail once it gets to that spot) Hitting stop, and re-selecting the file to print will usually resolve the issue.

If I upload via SCP which tends to work better all around for speed and reliability as well, it can take 15-30 seconds for the printer to "see" the file, but total bytes are always correct, and this bug doesn't happen.

All of this does indeed seem g_mass_storage related, and possibly related to file flushing/etc in the different upload methods at work.

luizribeiro commented 3 years ago

I'm not sure what causes it, but it does seem at least in my case if I upload a file via the Mariner interface and then go to print that file, often the total bytes read by the printer is less than it should be.

If you print a second time without re-uploading the file, do you consistently reproduce the issue?

Zenroth commented 3 years ago

I'm not sure what causes it, but it does seem at least in my case if I upload a file via the Mariner interface and then go to print that file, often the total bytes read by the printer is less than it should be.

If you print a second time without re-uploading the file, do you consistently reproduce the issue?

No, it’s a single fail case and next print is fine. Usually what I’ll do is see the bytes are wrong, hit stop, then hit print again and the issue is avoided.

luizribeiro commented 3 years ago

Yeah... I have a hard time believing it's related to upload then, since the file itself is fine. Even if the write buffer was taking a while to get flushed, when I encountered this problem it was well over an hour into the print. That feels like a really long time for a write buffer to not be flushed to disk... But it's possible I guess? Maybe we could write to /proc/sys/vm/drop_caches to forcefully flush them or something. I need to do a bit more of research to understand how g_mass_storage does its writes to disk tbh. Maybe next weekend :)

Edit: to be clear, I don't think it's related to upload, but it could be related to some write cache not being flushed. Alternatively we could also attempt to make the HTTP upload happen to a temporary directory (probably faster file transfer, mainly when people are running their pis from SD cards) and move the resulting file to /mnt/usb_share in some other way that flushes the cache (if we can indeed confirm that lack of flushing is the issue)

Zenroth commented 3 years ago

Edit: to be clear, I don't think it's related to upload, but it could be related to some write cache not being flushed. Alternatively we could also attempt to make the HTTP upload happen to a temporary directory (probably faster file transfer, mainly when people are running their pis from SD cards) and move the resulting file to /mnt/usb_share in some other way that flushes the cache (if we can indeed confirm that lack of flushing is the issue)

Agreed, the uploads in the end seem to be fine, other than maybe they aren't flushing immediately. An hour does seem a very long time though.

I feel like we do need to figure out a way to force a flush/sync update in general. While I never seem to have the bytes being read wrong when uploading directly via WinSCP, I do have to usually wait a while, or attempt to start a print multiple times usually getting 400 errors or file not found errors initially.

I was half wondering about adding some sort of system call that causes the volume to unmount and then remount upon a print request.

luizribeiro commented 3 years ago

From the mass storage gadget documentation:

AN IMPORTANT WARNING! While MSG is running and the gadget is connected to a USB host, that USB host will use the backing storage as a private disk drive. It will not expect to see any changes in the backing storage other than the ones it makes. Extraneous changes are liable to corrupt the filesystem and may even crash the host. Only one system (normally, the USB host) may write to the backing storage, and if one system is writing that data, no other should be reading it. The only safe way to share the backing storage between the host and the gadget's operating system at the same time is to make it read-only on both sides.

Well, that's likely it?

Maybe we can try passing removable=y to g_mass_storage and "insert"/"remove" the USB gadget through sysfs? Or even just unload/reload the g_mass_storage module (though I'd like to avoid that so we can still run mariner without root privileges).

I think ideally we wouldn't even keep the image mounted locally while the image is being used by g_mass_storage... But that would add a lot of limitations (e.g. not being able to upload files while printing), unless we keep the upload directory in a separate place and only sync the files to /mnt/usb_share occasionally.

If you're doing lots of prints, it might be worth just unloading/reloading the g_mass_storage module after uploading files to see if you can no longer reproduce the problem when doing that before we consider making changes?

Zenroth commented 3 years ago

So I can confirm that simply unloading and reloading g_mass_storage doesn't fix this it seems.

However, unmounting and remounting seems to be very solid, and also generally forces the printer to see a uploaded file pretty much immediately as well. (Seems within 5-10 seconds of mounting, no longer get file does not exist/etc errors)

luizribeiro commented 3 years ago

What if you run sync $PATH_TO_THE_FILE just after uploading?

If that works, then we could simply call os.sync() just after uploaded file is saved here: https://github.com/luizribeiro/mariner/blob/master/mariner/server/api.py#L157-L166

Zenroth commented 3 years ago

What if you run sync $PATH_TO_THE_FILE just after uploading?

If that works, then we could simply call os.sync() just after uploaded file is saved here: https://github.com/luizribeiro/mariner/blob/master/mariner/server/api.py#L157-L166

I'll play with this some more, but I went ahead and just added the os.sync() to the upload function. Then proceeded to upload 3mb-100mb files 1 by 1 for 15-20 minutes and attempt to execute a print upon immediate upload. (Usually this would routinely fail with total bytes being wrong, or the printer not seeing the file)

100% success rate, no wrong total bytes, no waiting on printer to see the files, etc.

I'll add the os.sync() to my other 2 printers I'm running as well, and use the web upload and see how that goes. Looks very promising though.

Zenroth commented 3 years ago

So another days worth of prints across multiple printers.

While the upload button is seemingly slow and often also gives me an error 400.. when it does work, I've had 100% perfect byte totals and no issues with the printer "seeing" the file. os.sync() seems to be a winner.

luizribeiro commented 3 years ago

While the upload button is seemingly slow

This is expected. We may be able to sync just the file that was just uploaded which may be faster, but I don't think os.sync() allows for that. We should be able to use os.fsync() instead for that, if we get the file descriptor of the uploaded file. If I have time on the weekend I'll take a look at that if no one else does before me.

We could also make the synchronization of the file happen after the response is served to the client, but it would come at a cost of potential flakiness. So I'm not sure if it's worth it.

often also gives me an error 400..

Do you have more details on this? 400 is bad request, so it would be a problem on the client and not on the server. Are you sure it wasn't a 500?

Zenroth commented 3 years ago

Certainly is 400, I mean upload in general just seems slow. (not necessarily the os.sync) Usually the error 400 seems like a timeout failure, and the wifi link to the shop isn’t the best. (Extended network, different building)

I’m not sure if WinSCP has error recovery and retry logic in it, but uploads never fail via it, and seem to generally be a lot faster.

Anyways the major take away here is that the sync does indeed seem to fix the bug and another issue as referenced.

image

luizribeiro commented 3 years ago

Just FYI that the fix for this is out on v0.2.0. There's also some new instructions on how to upgrade using the newly setup PPA repo here: https://github.com/luizribeiro/mariner/discussions/377