[BUG] Prusa Connect / Link Gcode Corruption on OTA gcode upload

jrgiacone commented 11 months ago

Please, before you create a new bug report, please make sure you searched in open and closed issues and couldn't find anything that matches.

Printer type - [MK4]

Printer firmware version - [5.0 Alpha 4]

Original or Custom firmware - [Original]

USB drive or USB/Octoprint USB Drive formatted as FAT32

Describe the bug When uploading gcode via prusa link and prusa connect over the air, gcode corrupts causing toolhead to move outside of path, sometimes it continues causing failed print, or the printer gets stuck and is unable to move.

For example the print head will move to the right or north of the object and pause/get stuck, print is unable to pause or stop.

How to reproduce With alpha 4, upload a gcode file to the printer and check the file that was uploaded to the printer USB in gcode viewer and compare it to an export direct from prusa slicer.

Note if you download the file directly from connect cloud it is correct, the corruption is occurring when uploading the file to the USB over wifi.

Expected behavior Gcode uploaded via wifi should be identical to a direct upload to USB via usb port

G-code Working and Problomatic Gcode.zip

Note the files are the same gcode, 1 is direct from prusa slicer to usb, the other is via prusa link. Please compare the tool path in gcode viewer to see this issue.

Crash dump file Please attach the crash dump file. This will make it easier for us to investigate the bug.

Video Please attach a video. It usually helps to solve the problem.

jvasileff commented 11 months ago

The differences in your files Print Directly from Prusa slicer.gcode and Print Uploaded via prusa link.gcode appear to be in 64-byte chunks. It's as if the transfer corrupted the file by mixing in chunks of some other, unrelated file.

This looks like a pretty nasty bug as mixing in random gcodes may harm your printer.

I noticed that the changes occur in 64-byte chunks by analyzing a few of the difference blocks found by running:

diff -u Print\ Directly\ from\ Prusa\ slicer.gcode Print\ Uploaded\ via\ prusa\ link.gcode

The blocks I looked at all had sizes that were multiples of 64-bytes.

vorner commented 11 months ago

I'd like some confirmation/clarification. By „Upload by wifi“ you mean the PrusaLink interface on the IP address of the printer / through IP address from slicer?

Does it happen every time, or just sometimes? How often?

Can you try with a different USB stick?

jrgiacone commented 11 months ago

It has occurred via upload through IP from prusa slicer, direct gcode upload from prusa link, and file upload from prusa connect to the printer.

Last night did another one via prusa connect and it was identical so I don't think its on every upload. Have tried with 2 different usb sticks, the stock and a new one

JohnnyDeer commented 11 months ago

@jvasileff we are trying to reproduce this problem and have a question to improve the reproduction process. Do you use MAC or Windows?

jvasileff commented 11 months ago

@JohnnyDeer I was only analyzing the files posted by jrgiacone. I have not reproduced this myself. This issue caught my eye because I have seen Connect become out of sync with the files on the printer's USB drive, and Connect seems to trust filenames too much when a new version is uploaded without changing the filename, but so far that seems unrelated.

jrgiacone commented 11 months ago

I have reproduced it on Windows and Linux, primarily using Linux. I don't have access to a mac. It does not appear to happen on every occurrence, I have found a power reset to sometimes fix it?

jrgiacone commented 11 months ago

gcode 2.zip happened again, labeled the one from prusa slicer OTA upload

Prusa-Support commented 11 months ago

Thank you for bringing this to our attention.

We acknowledge the presence of the bug and our developers are attentively investigating this issue. Unfortunately, the bug is hardly reproducible though.

We would appreciate gathering further details for the reproducibility of the problem and possibly feedback from more users.

Here are examples of possibly useful pieces of information that we may want to collect.

Exact steps and patterns followed by the users before the occurrence of the issue.
Computer hardware and operating system information.
Network and connectivity information like router model, distance of the printer from the router, internet speed tests made with other devices in the printer's position...
Programs and operations running on the computer that may take up substantial computer and internet resources.
Devices connected to the computer or to the same network that may take up substantial computer and internet resources.
Number of devices connected to the same network.
Status of the printer.
Last operations performed on the printer.
Third-party devices connected to the printer.
Reproducibility of the problem via Ethernet or via a separate Wi-Fi network (e.g. mobile hotspot).
Examples of corrupted files (before and after the data transfer).

Michele Moramarco Prusa Research

jrgiacone commented 11 months ago

Hi Michele,

Exact steps and patterns followed by the users before the occurrence of the issue.

This has occurred both after a print completion has been sitting idle for a while, and even after a restart. Unsure if board temperature has anything to do with it, typically buddy board is 55-60C when printing.

Computer hardware and operating system information.

Primary Computer is running Arch Linux kernel: 6.4.7 Intel i7 6700HQ Prusa slicer: https://archlinux.org/packages/extra/x86_64/prusa-slicer/

Windows computer is running Windows 10 with AMD 5800x and Nvidia 3080. Running prusa slicer release 6.0

Network and connectivity information like router model, distance of the printer from the router, internet speed tests made with other devices in the printer's position...

Modem: Spectrum Model EN2251 Router: Spectrum Model SAX1V1R Distance: About 15 Feet away Speed From Router is 450 Mbps Download and 20 upload from same position.

Programs and operations running on the computer that may take up substantial computer and internet resources.

The only programs running would be Firefox and prusa slicer as the system was tested at pretty much idle.

Devices connected to the computer or to the same network that may take up substantial computer and internet resources.

No Devices are connected to the laptop. The desktop has 2 monitors, a mouse, and a keyboard.

Number of devices connected to the same network.

Typically 6-10 Devices are connected.

Status of the printer.

Unsure on this one, I think it typically Idle or ready to print?

Last operations performed on the printer.

This has occurred both when transferring after a completed print, and after a restart.

Third-party devices connected to the printer.

Nothing is connected to the printer.

Reproducibility of the problem via Ethernet or via a separate Wi-Fi network (e.g. mobile hotspot).

Have not tried via ethernet, this has only been tested via wifi

Examples of corrupted files (before and after the data transfer). gcode.2.zip Working.and.Problomatic.Gcode.zip Note files in the folder are labeled which was transfered over prusa slicer OTA (over the air) to printer.

The Gcode 2 folder was when printing this stl: https://www.printables.com/model/506041-prusa-mk4-filament-guide-for-ptfe-fitting

When comparing the gcode 2 file, it appears to have randomly added in a filament swap that just moves the print head away and the back without any user interaction (not this was not intentional and was not present on the gcode uploaded directly to USB from the usb port on the computer).

I am happy to help test in any other was or provide any other information needed. @Prusa-Support

jrgiacone commented 11 months ago

Gcode Example 3.zip

It seems to have done it again uploading from prusa slicer OTA, note when I reuploded it there was no issues with the second one. Attached are screenshots of the differences in gcode Diff 1 diff 2 diff 3

mix579 commented 11 months ago

If OP hadn't said s/he tried two USB sticks, this has all the hallmarks of a corrupted USB stick?!

jvasileff commented 11 months ago

I took a closer look at gcode 2.zip and Gcode Example 3.zip, and what I found about the corruption is pretty interesting. Below, I'll refer to the "good" gcode file as v1, and the corrupted file as v2.

The corruption occurs in exactly 32-byte chunks. To analyze the corruption, we can first split each file into a bunch of 32-byte files with a command like split -b 32 -a 4 v1.gcode. Fingerprints for each chunk can then be created with find v1 -type f -exec md5 {} + > v1-md5s. We are then left with two files - v1-md5s and v2-md5s, which fully describe the original and corrupted gcode files in 32 byte chunks.

Analysis can be done with commands like diff -u v1-md5s v2-md5s, yielding:

--- v1-md5s     2023-07-29 12:09:55.952211454 -0400
+++ v2-md5s     2023-07-29 12:10:55.186357521 -0400
@@ -469,7 +469,7 @@
 MD5 (./xckhs) = 84e3ffd2863ba304f27d146a2dc59996
 MD5 (./xavya) = f3f08a555cef524001fab082ee6a6a50
 MD5 (./xakwl) = 285d1a7aa761de0db83219e19e28e388
-MD5 (./xakqu) = bed64c8ba35a13cbdbffcde3c9b39e27
+MD5 (./xakqu) = 862a0484fb481016590124d798f229f3
 MD5 (./xbwme) = 6bea35919a15b764ff5ed25351f4e07d
 MD5 (./xcand) = 75d0ae1e30c6b006b7ed840374c12b97
 MD5 (./xbhfs) = 48460aac9f132df8f4565c027a2cfad2
@@ -4472,7 +4472,7 @@
 MD5 (./xcikp) = 93bb8e33b593b70d49ab8b9c53c265c2
 MD5 (./xatzb) = 79eeb96b402d344ac274854716a61614
 MD5 (./xakwm) = dfd8a12e34b0080d2993659e5bf3e560
-MD5 (./xakqt) = 18c413f2d7af5294875aa3807c5e9cb7
+MD5 (./xakqt) = 3ef43351da57b392f9359baf6cc88d0e
 MD5 (./xcane) = 401827f26696585ff8b4e7199af159dd
 MD5 (./xbwmd) = 612723674e9ef948cb19986835f97226
 MD5 (./xbzqk) = 6804cafcfeab5fed4b980235d729a610
@@ -4887,7 +4887,7 @@
 MD5 (./xatze) = 715cfef9350c0c72eac8ffeb9e893f74
 MD5 (./xcikw) = 251d94b211b9fc8348990da3825b19f6
 MD5 (./xcimn) = 44c8ec41cdcf40294eda5bd97df537b7
-MD5 (./xakqs) = 081676dadbd1c9f40529aad093c99ae4
+MD5 (./xakqs) = d9bb7e5b7dc91f8b5c21112c0d128a41
 MD5 (./xakwj) = 5e9d00603a07adeb49d337beea986c44
 MD5 (./xcvfx) = ce05871851ab48e69b54fa18aaa6bdbb
 MD5 (./xbwkz) = 902a400864f977a95ee63a49475ba269
...

In red, we can see the hash that we should have for each chunk, and in green we see the hash that we actually have in the corrupted file. By looking at this diff, the following observations can be made:

All corruption is indeed in 32-byte chunks
The vast majority of "bad" chunks in the corrupted files exactly match content in the original file, it's just that the 32-bytes are repeated from some other location in the source file.
The relatively few "bad" chunks in the corrupted file that cannot be found in the original file look like garbage or possibly FAT32 filesystem info
Interestingly, "bad" chunks in the corrupted file often contain data "from the future" - that is, data that only occurs later in the original file. For example, in the first diff above, 862a0484fb481016590124d798f229f3 occurs in the corrupted file as the 472nd chunk. It also occurs in both the original file and corrupted files as the 20553th chunk. This indicates that the corruption occurs sometime after the bytes have been sent over the wire, or that new bytes are overwriting previously written bytes.

For Gcode Example 3.zip, there are 78 corrupted chunks. For all but 12 of those chunks, the data also occurs in the original file. The content of the other 12 chunks is:

% for i in $(cat chunks-not-in-v1); do echo "\n$i" ; cat $i ; done

v2/xbuzu
@!A!B!C!D!E!F!G!
v2/xbvac
?!?!?!?!?!?!?!?!
v2/xbvab
x!y!z!{!|!}!~!!
v2/xbvaa
p!q!r!s!t!u!v!w!
v2/xbuzw
P!Q!R!S!T!U!V!W!
v2/xbuzx
X!Y!Z![!\!]!^!_!
v2/xbuzy
`!a!b!c!d!e!f!g!
v2/xbuzt
8!9!:!;!<!=!>!?!
v2/xbuzz
h!i!j!k!l!m!n!o!
v2/xbuzs
0!1!2!3!4!5!6!7!
v2/xbvad
?!?!?!?!?!?!?!?!
v2/xbuzv
H!I!J!K!L!M!N!O!%

I have no idea what's going on here. But perhaps this will help someone with more knowledge of the PrusaLink firmware code or the FAT32 filesystem code. Regarding @mix579's comment, it does seem like a problem with the filesystem, but it's weird that the corruption is only happening when the files are being written by the MK4. Maybe using the USB stick on the Windows or Linux machine created some metadata that the MK4 can't handle?

rchiechi commented 10 months ago

I am experiencing the same issue copying over wired LAN. But I found that, with the example gcode attached (made from the example STL), I can trigger the bug by changing the filament profile to 3D fuel Pro PLA.

example_gcode.zip Original_breaks_over_LAN.gcode is the file written locally to disk that will break when exported over LAN. Broken_breaks_over_LAN.gcode is the file that was exported over LAN and that is different from the original that was exported to disk. Original_does_NOT_break_over_LAN.gcode is the file is identical when written to disk or exported over LAN, i.e., exporting it over LAN does not result in a changed file. example_STL.zip

I tested it back-and-forth twice each with the same positive/null result each time. I do not know if it is triggered by switching filaments, generally, and I have had other prints with the 3D Fuel Pro PLA filament setting transfer correctly as well as prusament prints that broke. I only know that the steps outlined below were reproducible at least twice.

OS/Software: Ubuntu 23.04 6.2.0-26-generic PrusaSlicer-2.6.0+snap4

Printer: MK4 kit with 5.0.0-alpha4 firmware No third-party anything plugged in.

Steps to reproduce:

Open PrusaSlicer with default settings and load STL file. Change filament profile to 3D Fuel Pro PLA. Export gcode to disk from PrusaSlicer. Export gcode to MK4 over LAN from PrusaSlicer. Unplug USB stuck from MK4 and plug into computer. Compare md5 sums of the two files and they will be different.

Steps that do NOT trigger bug:

With PrusaSlicer still open to the same file, change filament profile to Prusamment PLA. Export gcode to disk from PrusaSlicer. Export gcode to MK4 over LAN from PrusaSlicer. Unplug USB stuck from MK4 and plug into computer. Compare md5 sums of the two files and they will be identical.

antimix commented 10 months ago

Hi Michele, as programmer, even if I did not have time to look around the source code, the issue seems related to a "rolling buffer", especially the _garbage from the _future__ Look for this occurrence:

a pointer to a memory buffer variable (or to an array)
an index pointer based on the offset of the buffer
when bytes are received, they are added by the index pointer on the offset of the buffer.
unfortunately (check why) when the last chunk of bytes are received, they are written to the top of the buffer this can be due to: memory segment overlap, or incrementing and resetting the index pointer BEFORE writing the bytes in the buffer. This explains why the bytes from the future that are the final part of the chunks, and they are written on the top of the buffer overwriting the correct one, and then written to the file.

jstm88 commented 10 months ago

I've also encountered the same bug on my own MK4 running (non-alpha) firmware 4.7.1.

I have noticed (and others have commented) how slow gcode transfers via PrusaLink seem to be, so there's definitely a bottleneck somewhere. If a circular buffer is in use, a bug like this would be possible in two places: when the buffer fills, or when the buffer empties. In other words, writing too much data or reading too far ahead could both be a problem. And since we have seen cases where the out-of-place block appears to come from the future, I would suspect that the USB write is the bottleneck and data from the network overruns the current index.

I'd be on the lookout for:

An off-by-1 indexing error which allows the start/end of the buffer to overlap (and if this happens multiple times in a row, you'd see the "multiples of 32" issue)
A race condition between the network writing to the buffer and the USB code reading from the buffer (in this case, multiples could be caused by timing)
Possibly both?

I also think there needs to be some validation of the file after writing. A checksum would be the most logical approach. That clearly doesn't solve the problem, but it would make sure it was caught, and it would provide absolute certainty that a file was transferred correctly no matter what bugs may or may not exist.

antimix commented 10 months ago

In the meantime, I have just started to use my new MK4 with 4.7.2 and in order to check if this issue appears, I am using a FlashAir SD instead of the USB dongle (that is a USB Card reader with a FlashAir SD inserted ;) ) so that, since it is mapped as drive K: to my PC, I can immediate check with a compare app if the transferred file on K: is the same on my C: drive. Till now on the 4.7.2 it never failed.

jvasileff commented 10 months ago

Folks, apologies, there is an error with my prior analysis. I did not sort the output when generating md5s. To keep the md5s for the chunks in file order, the command:

find v1 -type f -exec md5 {} + > v1-md5s

should be

find v1 -type f -exec md5 {} + | sort > v1-md5s

For ease of analysis, split can also be adjusted to use numerical filenames:

split -d -a 10 -b 32 ../v1.gcode

The good news, is that with this, the corruption is much more straight forward. For the errant chunks that also occur in the source file, the errant file's chunk occurs in the source file exactly 512 or 384 chunks earlier for the bad chunks I analyzed. So there is no "time travel". Instead, stale data is being written. It's as if there is a race condition or invalid state on a 16KB (512 * 32 byte) buffer, where the file writer re-reads slightly older data.

There are also a limited number of invalid chunks that don't occur in the source document that may or may not indicate a second bug.

Here's a portion of a diff for this new, corrected analysis for the gcode 2.zip files:

--- v1-md5s     2023-08-15 11:54:53.998600412 -0400
+++ v2-md5s     2023-08-15 11:55:06.809257874 -0400
@@ -1166,22 +1166,22 @@
 MD5 (./x0000001165) = 7dcca393b45f7a4e507f11908c3ef554
 MD5 (./x0000001166) = 507bc37bbdd7edd77e9e7ad18a6ea1e6
 MD5 (./x0000001167) = 147d0690887a8876fc4d3fa95ea95eb1
-MD5 (./x0000001168) = be44c6fd054a4cc001faf5b6e8119131
-MD5 (./x0000001169) = df10e4c0f8b6e5bdf82229de819d110a
-MD5 (./x0000001170) = 5962dc85f9690e8d7ff0643c39fc02a7
-MD5 (./x0000001171) = 692d3593cc7a80020d3d6819f64778e2
-MD5 (./x0000001172) = 6697ef83d06af695819c2560e82f3a15
-MD5 (./x0000001173) = b056fb99bb4c9f8124d4931c1d26e1e4
-MD5 (./x0000001174) = 18d57ee990d53df959908d49f3255e16
-MD5 (./x0000001175) = 62e96e338b7b97f731556d1893ef98c0
-MD5 (./x0000001176) = b513792a428a2d5cfa4d718072bc5e71
-MD5 (./x0000001177) = 5fd9758e54f57a84b0a77b5bb4b1b025
-MD5 (./x0000001178) = 7544e188e2bc29c448ee3baa7d1f6ef7
-MD5 (./x0000001179) = 7f8245d257a21709e2b910fa924aa714
-MD5 (./x0000001180) = 9c5188cbd1be570412971a915af272a9
-MD5 (./x0000001181) = b0b3fbd6127fe50302266d68c29ae55f
-MD5 (./x0000001182) = 8b153042a67d9089acc13457761fd9b7
-MD5 (./x0000001183) = 1138824b3d3805fcace55be1de124f5e
+MD5 (./x0000001168) = 0b95690866a25a8827cab31847983b42     Same as chunk 656
+MD5 (./x0000001169) = 5f54e0382b210e480506a2c9880cc5f9     Same as chunk 657
+MD5 (./x0000001170) = 4dea27d6cfcabcf243430d6587ea7c8c     ...
+MD5 (./x0000001171) = 6b9a7da3102d09754ec743863ce60892     ...
+MD5 (./x0000001172) = 9876231de592c95acd0e4fc5dbc57224     ...
+MD5 (./x0000001173) = 2130c1f4077e1c93aa359220549ca678     ...
+MD5 (./x0000001174) = 2e1a377cd6eff7b007bbb86c8e1771a0     ...
+MD5 (./x0000001175) = 91c2480d8cc85e67d7ad55d653180253     ...
+MD5 (./x0000001176) = d303c53950cec87da6e41556d4c131b6     ...
+MD5 (./x0000001177) = 4a52e560ad6ae6d23c85de9f38102e25     ...
+MD5 (./x0000001178) = dd2efab2edb51b85c32122e6060f9901     ...
+MD5 (./x0000001179) = b95b7a9ca5055ed1d47ea6b5735efedf     ...
+MD5 (./x0000001180) = 2a25b4d3362a7ff1a95b77d60e7a76a0     ...
+MD5 (./x0000001181) = e1c5eb01c7ebec400c15b1703a38fc3c     ...
+MD5 (./x0000001182) = 89ddfbbf72afa0df1332b50aa3d7345c     ...
+MD5 (./x0000001183) = 709cfbbaa189e4f8af3665ff016d8145     Same as chunk 671
 MD5 (./x0000001184) = 695578208be168de26935e05618c2d4d
 MD5 (./x0000001185) = d749863b6081e46fa43ede15f4ba9a2b
 MD5 (./x0000001186) = 9f9ed60468961dba9b55c6c55b0c6257

jstm88 commented 10 months ago

Just as a suggestion to others, here's what I'm going to do while this is being worked out:

Export the gcode to a local file
Upload via the PrusaLink web interface
Download a copy from the web interface
Hash both files and confirm a match
Print

Of course, the download code may also have a similar flaw, so it will be interesting to see if this gives any false positives. But with this method, we can at least guarantee that there's no risk of damaging the printer, and we can still have the convenience of not having to shuttle USB drives back and forth (and put stress on the very tight USB port the MK4 has).

I would also like to see PrusaSlicer itself perform this check (i.e. download the file and compare hashes before starting the print). It could even be applied to OctoPrint connections as an extra safety check. Not that I've ever seen corruption with OctoPrint, but it never hurts to have extra verification...

abjugard commented 10 months ago

I haven't done the work to assess if I have experienced this issue myself, but from reading about it, @Prusa-Support if you're unsure where to start looking for problems, I'm gonna go ahead and guess this has to do with the frankly terrible network to USB file transfer code. Get a networking engineer to rewrite that for you and I bet the issue will disappear in the process.

vorner commented 10 months ago

Reading through the comments in here again, I'll re-check one thing.

@jrgiacone Are you 100% you've managed to get a corrupt transfer from Connect too?

My first thoughts at the time were pointing to certain newly introduced code (which'll admin I'm not really proud of, it was written a bit in desperation). The further comments here would actually strengthen the feeling. Except the code is not being used at all in the Connect transfer, only in Link so I've ignored that feeling.

jrgiacone commented 10 months ago

@vorner I believe I have, however, I've only tried connect a few times so I could have miss remembered, the majority of the issues have come from using prusa slicer to upload direct to prusa link through a physical printer.

With regards to connect I could be remembering wrong, I stopped using it because it would take way too long to upload any files. So I would say I am not 100%

jstm88 commented 10 months ago

I wrote a quick (i.e. "bare minimum") Python script which uploads the file, then downloads, hashes the download, and compares it to the original. The download and compare happens in memory so there's no need to clutter up your downloads directory with copies. If you wish you can automate it to run the script whenever a file is dropped in a specific directory.

I wanted to make it delete the uploaded file if it's found to be corrupt, but even when it's not set to print on upload, the printer loads the latest file uploaded and that makes it give an error when attempting to delete it since it considers the file to be in use. So no matter what you need to walk over to the printer and cancel it. The script also lacks some proper robust error handling but it's "good enough". Feel free to adapt to your needs.

#!/usr/bin/env python3

import os
import sys
import json
import time
import hashlib
import requests
import urllib.parse

PRINTER_URL="http://..."
API_KEY="..."

src_path = os.path.expanduser(sys.argv[1])
src_filename = os.path.basename(src_path)

with open(src_path, 'rb') as f:
    src_rawdata = f.read()

src_hash = hashlib.sha256()
src_hash.update(src_rawdata)

try:
    put_response = requests.put(
        url="{}/api/v1/files/usb/{}".format(PRINTER_URL, urllib.parse.quote(src_filename)),
        headers={"X-Api-Key": API_KEY, "Content-Type": "text/x.gcode", "Overwrite": "0"},
        data=src_rawdata)
except requests.exceptions.RequestException:
    print("HTTP Request failed")
    sys.exit(1)
if (put_response.status_code != 201):
    print("Upload error")
    sys.exit(1)

print("Successful Upload!")

put_rdict = json.loads(put_response.text)
dl_path = put_rdict.get('refs').get('download')
try:
    dl_response = requests.get(
        url="{}{}".format(PRINTER_URL, dl_path),
        headers={"X-Api-Key": API_KEY})
except requests.exceptions.RequestException:
    print("HTTP Request failed")
    print("File not validated")
    sys.exit(1)

if (dl_response.status_code != 200):
    print("Could not validate file")
    sys.exit(1)

dl_hash = hashlib.sha256()
dl_hash.update(dl_response.content)

print("SRC: {}".format(src_hash.hexdigest()))
print("DST: {}".format(dl_hash.hexdigest()))
print("")

if (src_hash.digest() == dl_hash.digest()):
    print("Files validated!")
else:
    print("UPLOADED FILE CORRUPTED")

jvasileff commented 10 months ago

@jstm88 Have you had any failures? It seems that for those affected, the failures happen fairly often, but most people have no failures at all. It makes me wonder if there are different buddy board versions that run different code paths.

If corruption is reproduced with your script, it should be useful for troubleshooting as it could be tried in different environments.

jstm88 commented 10 months ago

@jvasileff I haven't noticed any more failures.

I actually ran several dozen prints on the MK4 after I built it without seeing any obvious problems. Those gcode files were all uploaded directly from PrusaSlicer, so the file on the printer would have been the only copy with nothing to validate against. It's possible there were corruptions but they weren't significant enough to show up in the prints.

I first noticed the problem on a relatively large one-off print and I saw a few of the external zits just near the top. I don't have the file any more but inspecting the print it definitely looks like they all happened towards the end of the file. The very next print I uploaded after that one had corruption throughout the entire file.

For what it's worth, I re-uploaded that specific file several times using my script and there was no corruption.

This actually has me wondering if the likelihood of corruption increases as the storage fills up. I'm using a 64GB Samsung drive so it would take a long time to fill it up, but I wonder if there might be some problem with higher block offsets, or potentially issues that happen as the allocation table grows. We'd need a lot more data to determine if that has anything to do with it. In other words, has this ever happened on an empty drive, and does it become more likely if there is more data on the drive? I can't say at this point. If we can reproduce it on an empty drive we could disprove the theory, at least.

rchiechi commented 10 months ago

FWIW all of the corruptions that I encountered have been from a 256 GB drive that is nowhere near full. I also had fewer problems with smaller (in file size / print time) prints, where the corruptions is only noticeable as some small flaws in a couple of layers.

jvasileff commented 10 months ago

This actually has me wondering if the likelihood of corruption increases as the storage fills up. I'm using a 64GB Samsung drive so it would take a long time to fill it up, but I wonder if there might be some problem with higher block offsets, or potentially issues that happen as the allocation table grows.

Yeah, or if write sizes vary with fragmentation.

vorner commented 10 months ago

There seems to be a lot of speculation in here… so, let me share something from what we have.

There are not different code paths depending on board version regarding the networking code.

We have a working theory (very much unconfirmed yet). If it is correct, the bug is very timing dependent. This involves both the speed of USB (which does change with fragmentation, yes) and the „effective“ wifi speed. This seems to be the reason why some people can reproduce it often while others never see a corrupt file (we still haven't been able to reproduce despite bombarding printers with hundreds of transfers).

An interesting experiment could be to move the printer further away from the router, or put some interference between them, simply to see if the change of timing leads to the bug disappearing.

rchiechi commented 10 months ago

An interesting experiment could be to move the printer further away from the router, or put some interference between them, simply to see if the change of timing leads to the bug disappearing.

I got many corrupted files transferring over a 1Gb wired Ethernet switch. It is a 24 port Ubiquiti switch (not a cheap no-name) and the printer and computer are each connected by less than two meters of cat-6 cable. No WiFi involved.

Are you saying that if the network connection is to fast it might trigger the bug? If so, I can give it a go on the WiFi and connect the printer to an AP on the other side of the house.

jstm88 commented 10 months ago

I can confirm my corrupted files also occurred with the printer connected via a 1Gb wired connection.

The upload speeds are extremely slow. The printer is the bottleneck, although it's unclear if it's the networking code, USB code, or the USB hardware itself. Local network transfers with other file servers are as fast as expected (near gigabit) and this also applied to OctoPrint when I was using it.

rchiechi commented 10 months ago

The upload speeds are extremely slow. The printer is the bottleneck, although it's unclear if it's the networking code, USB code, or the USB hardware itself. Local network transfers with other file servers are as fast as expected (near gigabit) and this also applied to OctoPrint when I was using it.

Same for me: transfer speeds to the printer are far slower than they should be even for the 100Mb Ethernet interface on the printer. But I haven't tried OctoPrint, just PrusaSlicer and via Python requests.

vorner commented 10 months ago

No WiFi involved.

OK, then our working theory is not the right one :-|

vorner commented 10 months ago

Few more things I'd like to confirm:

Each corruption is „unique“ ‒ that is, if the same file is transferred twice and both cases are broken, they are broken at different places (or sometimes not broken).
This problem is specific for this version. The stable version (4.7.something) doesn't suffer from this bug. Have you also tried the newest one (5.0.0-RC, I assume it would still be present there)?
Would anyone be able & willing to get a PCAP from the transfer? If we had a pair of PCAP and the corresponding broken file, we could try correlating them ‒ see if there are some interesting things happening around the time of the broken part (duplicate packets, retransmissions, …).

As for the speed of the transfer. I don't think I should be saying much before we have anything to show about it, but we are working on that one.

rchiechi commented 10 months ago

@vorner Personally I have only had corrupted files on the 5.0.0-alpha4 when transferring from PrusaSlicer. I am now on the 5.0.0-RC and have been shuffling files to the printer by moving the thumb drive back-and-forth between the PC and the printer to avoid potential corruption. But I will try to find some time this weekend to reproduce the corruption. Do you want a packet capture of the whole transfer? If you have some tips for (I assume) tcpdump switches that would be great.

vorner commented 10 months ago

Thank you.

Well, I want the part that breaks the gcode (if there's some), but I assume just sending the whole transfer / all communication with the printer during that time is much easier to provide. It shouldn't include anything private (the authentication to the printer uses digest-md5, so it should not leak the password, but if you're worried about that, it's possible to generate a new one in the menu).

I usually use something like tcpdump -w printer.pcap host ip.of.the.printer or wireshark with the same filter (host ip.of.the.printer).

jstm88 commented 10 months ago

@vorner

This problem is specific for this version. The stable version (4.7.something) doesn't suffer from this bug. Have you also tried the newest one (5.0.0-RC, I assume it would still be present there)?

I've been running 4.7.1 the whole time and that's when the corruption occurred. I'll be updating to 4.7.2 soon (although the release notes don't indicate any networking/storage changes there).

antimix commented 10 months ago

I am 4.7.2 (and the MK4 is connected via wireless) and I never had corrupted files. So the issue should be specific for the 5.0.

jstm88 commented 10 months ago

Packet capture of the corruption happening in real time. Firmware 4.7.1.

I used my Python upload script for this:

❯ prusasend ~/MK4\ GCODE/RR\ Mini\ Revolver\ v1_0.4n_0.07mm_PETG_MK4_24m.gcode
Successful Upload!
SRC: 79a37fe30848bb60a237d350eb5c35a0f4a32922be549d6f2ec7924f48d78ec9
DST: 229a717298f1e7798606958eca9e5733922b2dff435ee80a590b5bb8fd70fe2c

UPLOADED FILE CORRUPTED

PCAP and Before/After GCODE files (the one with the timestamp is the corrupted one) mk4-upload-corruption.pcap.zip gcodes.zip

In the PCAP, there are lots of retransmissions and many are marked "TCP Window Full" by Wireshark. Should be plenty to prove that's where the corruption is happening.

Also you can see about halfway through the PCAP, the file finishes uploading and then the file begins being downloaded by my script. That download process proceeds without any issues, and the resulting hash is accurate to the file as stored on the drive. That confirms that downloading shouldn't be introducing any further corruption.

I also tried several times in a row. Corrupted every single time, with different sha256 hashes each time.

rchiechi commented 10 months ago

@jstm88 I spent all day yesterday trying to trigger the bug on the 5.0.0-RC firmware and couldn't do it even after dozens of uploads.

I am curious if you also see glitches in the web interface. On the 5.0.0-alpha4 firmware (the one that produced corrupted gcode for me) the list of files on the web interface (on Firefox, Chrome, Edge and Safari / Linux and MacOS) would constantly refresh such that I could not scroll or click on any items. That problem is gone on the 5.0.0-RC firmware and apparently so is the gcode corruption---or least it has become so rare that I can no longer observe it.

mallek commented 10 months ago

I am also getting this same behavior, I am using Wifi and within 3 meters of the Wifi router and the printer. I have gotten the same behavior on 5.0 RC and I downgraded to 4.7.2. I get this corruption on every upload via PrusaLink.

PC_dragon_nice_1_0.4n_0.2mm_PLA_MK4_9h14m.zip USB_dragon_nice_1_0.4n_0.2mm_PLA_MK4_9h14m.zip

I can do a PCAP if you would like to help identify the error

I posted on Reddit showing a video of what is happening in the print with the faulty gcode https://www.reddit.com/r/prusa3d/comments/15wiz17/random_movement_in_xy_axis_in_middle_of_print/

rchiechi commented 10 months ago

I hit the bug on 5.0.0-RC. Sorry I did not get a packet capture, but here is the source.gcode file and the broken dest.gcode file.

broken in 5-RC gcode.zip

edit: the 'source.gcode' file in that zip archive reliably triggered the bug three times in a row, so I was able to get a pcap.

jstm88 commented 10 months ago

So over the last few days I did some experiments, and I've found something very interesting. My key observation is that I have actually been unable to reproduce the problem using the original, Prusa-supplied USB drive.

The drive I have been using is a 64GB Samsung drive, FAT formatted. I tested the drive for integrity and it is perfectly fine. However, I tried filling both drives by repeatedly uploading files. The original drive would not show the problems even though it had a lot more data. That got me thinking, so what I did was do a bit-level clone of the 16GB Prusa-supplied drive onto the first 16GB of the Samsung drive, effectively copying everything I can about the formatting/layout of the original drive. Since then I have had no issues.

I have a couple theories:

The formatting on the 64GB drive may have some slight variation that the MK4 was not happy about. I haven't investigated this yet, but it's something to keep in mind. The FAT filesystem isn't particularly complex and I don't know what this may theoretically be. The formatting was done on a Mac, which does support formatting a 64GB drive as FAT. The documentation does say that Windows cannot do this, but it doesn't say it's not supported by the printer.
Leading to the other theory... that the printer simply does not support 64GB FAT volumes. Again, the Prusa documentation says that Windows and some OS X versions can't do it, but it suggests using a 3rd party tool, not that a >32GB drive can't be used.

I'll continue monitoring and see if the issue ever reappears with the reformatted "16GB" Samsung drive. It would be interesting to see if anyone has had these issues on a 16GB drive, and especially if anyone can confirm the issue on the Prusa-supplied drive.

Also: I updated my Python upload script to include statistics (so now you can be reminded just how slow the upload is!) and gave it a little better error handling. I also created a quick shell script to go along with it that selects a gcode file from a directory using fzf and sends it off, then archives the file only if it uploads successfully.

prusasend.py: https://gist.github.com/jstm88/f25cffcd0c5a22e564fe8fe20114d3ad
psend.sh: https://gist.github.com/jstm88/5cfc5726785e0a94554ddf19ce02293b

flyingkillerspacepixies commented 10 months ago

I'd say +1 for the theory that it's something to do with formatting or capacity of non-prusa flash. I didn't get file corruption but I got something else pretty nasty that might be related.

I never had any issues with the prusa supplied flash drive, and when I saw jstm88's post I figured I'd try an old 64GB Usb 3.0 sandisk drive I had lying around. I formatted it in arch linux using the most basic partition table (mbr table, 1 partition, full drive usage) and mkfs.fat -F32. Then I uploaded a gcode and started printing.

My print stopped halfway through, the nozzle cooled down but the bed remained heated and all the menu's became REALLY slow as if there was a memory leak (I took a video, but I think you can probably get the picture when I say it took a full 35 seconds from when I clicked home for it to actually draw the home screen). The networking on the printer also died at this time. Every menu had the same issue, and it even seemed to struggle to draw the temperature updates for the bed.

The moment (not an exaggeration) I unplugged the flash drive everything went back to normal. All menus became perfectly responsive again, the networking worked, etc. Needless to say, I'm going to stick with the prusa flash drive for now.

(also I have a pcap of the whole incident and a video if prusa really wants one or both but it'll take effort to trim them down and I doubt either will be much help so I'm not going to go out of my way to do it just yet)

mallek commented 10 months ago

I would like to add that I too was using a 128gb Samsung flash drive that I formatted the first 64GB in fat32

rchiechi commented 10 months ago

I can't test the Prusa USB stick because it failed the first time I used it. I was trying to set up WiFi by creating a text file with credentials as per the instructions, but the text file became corrupted every time I plugged it back in to the printer. I tried formatting it, but it bricked, no longer recognized as storage. The printer also completely refused to print until I re-ran the initial calibration with a different USB stick plugged in to the printer.

Since then I've used a 256gb Samsung that I formatted with parted. It works perfectly if I copy files by hand. I only see corruption on network transfer, but, as noted, sporadically.

antimix commented 10 months ago

Very Interesting discover @jstm88 ! I have just realized that I am using the FlashAir SD without problem, but is is 16GB, so may be that 32GB or less works well. Nobody mentioned, but it ma be important, that when I cloned the files from PRUSA USB to the FlashAir (using Beyond Compare, file copying, not cloning the image) I noticed that there were a lot of hidden files and folders, like the USB has to be used on a sort of micro Linux system. So I decided to copy also all that system and hidden files and folder over the SD, and since I had basically all the structure I had on the Prusa USB, it always worked with no issues.

Did you all copy also all the hidden files & folders ? If you just format the USB that structure is missing. So could be of value, use format FAT32 the 64GB USB, and then copy all the files structure, and see what happen. May be we identify if the issue is on the FAT format or on the files structure missing.

flyingkillerspacepixies commented 10 months ago

@antimix what hidden files are you talking about? all the hidden files on my drive are basically just indications that the people at prusa making the flash drives use apple computers and use opera as their web browser ;)

I removed most of them from my drive a while ago and didn't notice any issues. Here's my listing of the files/directories in the root:

.fseventsd
._Mini Sandy Buggy
.Spotlight-V100
.TemporaryItems
.Trashes
._DualColor_Keychain_0.4n_0.2mm_PLA_MK4_16m.gcode
._Filament_Guide_0.4n_0.2mm_PLA_MK4_1h38m.gcode
._Keychain_0.4n_0.2mm_PLA_MK4_16m.gcode
._MK4_firmware_4.6.4.bbf
._MK4 models authors.txt
._Robo_Alpaka_0.4n_0.2mm_PLA_MK4_7h46m.gcode
._Rocket_Engine_0.4n_0.2mm_PLA_MK4_13h17m.gcode
._Spatula_Printables_0.4n_0.2mm_PLA_MK4_45m.gcode

flyingkillerspacepixies commented 10 months ago

The formatting of my prusa flash drive (or at least the partition table) is pretty interesting now that I look at it. I'm not sure if this is just OSX being weird (since apparently that's what prusa uses in making these) or something else but:

It has the bootable flag enabled (what? why?) It starts at sector 128, that's probably fine given the medium/constraints, but it's still sort of weird to see it using something which is less than 2048.

Device     Boot Start      End  Sectors  Size Id Type
/dev/sdb1  *      128 31948799 31948672 15.2G  c W95 FAT32 (LBA)

I don't have windows or OSX on hand to test this behavior against but android formats drives without the bootable flag and puts the sector at 2048 like I would expect. I've never seen the bootable flag applied on some new flash memory.

In other news, the strange behavior I described earlier is consistent and actually this time I caught it before the nozzle was turned off. It seems actually like the print just auto cancels after enough time has passed with it not going further. Before I had just caught it before the insane lag would allow it to turn off the bed I guess. Perhaps it's basically all just symptoms of memory corruption in the filesystem handling code?

I also forgot to mention during that last test I copied all the hidden files from the prusa drive over to the 64GB drive.

rchiechi commented 10 months ago

I tested a network upload using a gcode file that has reliably been triggering the bug. I used two USB sticks, both formatted MBR / FAT32.

Uploading to a 256GB thumb drive formatted using gparted results in a broken gcode file. Uploading to a 16 GB thumb drive formatted using the MacOS disk utility results in a successful upload!

Sorry to change two variables, but I'm in the middle of a long print and can't retry with the 16 GB stick formatted using gparted or vice versa at the moment. But the observation certainly supports the hypothesis that the corruption depends on the thumb drive.

EDIT: I reformatted the 256GB drive with MacOS Disk Utility and retried the upload. It was successful, but I again reformatted it with gparted, retried the upload and it was still successful. So, at least in my case, formatting the drive with Linux/gparted vs MacOS/Disk Utility and/or changing between 16GB and 256GB does not have an immediately reproducible effect on the bug.

jseyfert3 commented 10 months ago

Just chiming in that I also don't think it's due to any hidden files or folders. My Prusa drive failed a couple weeks ago. I've been using this really cheap keychain USB drive that I normally carry on my keychain. It's super slow, only 8 GB, and I don't think I've ever formatted it myself. I had no issues with it, but I only used PrusaLink once or twice. I've used PrusaConnect a lot more, and the majority of the time I just transfer the USB back and forth because the printer is right next to my computer and if I have a large G-code I don't want to wait for it to transfer. Unless my printer is already printing, then I'll use PrusaConnect to transfer newly sliced files as it prints since I can't unplug the USB drive.

I don't think I have a USB drive bigger than 32 GB or I'd try it and see if I too got corruption on a large drive but not a smaller drive. I do have a 2 TB SSD, but it's formatted with Linux Mint as a bootable OS and I'm not sure I want to bother moving partitions around just to use it as a bug test device.

jvasileff commented 10 months ago

FWIW, I've been using a macOS formatted SanDisk 128GB usb drive without any problems.

While anything is possible, I'm a bit suspicious that disk size matters, since 1) I assume well tested third party code is used to read and write the FAT32 filesystem, and 2) the corruption happens in small chunks, but my hunch is that overflow bugs in the filesystem code would produce more significant corruption.

That being said, one test might be to manually partition to force the start of the partition to be towards the end of the drive, to exercise the code that writes to the block device. Another test would be to partition normally, but then mostly fill the disk (on a pc) with large files, hoping to push further writes towards the end of the filesystem.

Aside from size, it seems like the state of the filesystem and USB drive might matter, since that might affect timing and what happens in the code that reads from the network and writes to the drive. Testing with an old/cheap usb drive or SD card with a USB adapter might be useful, as would testing with a heavily fragmented filesystem.

Time permitting, I may try some of these things next week.

prusa3d / Prusa-Firmware-Buddy

[BUG] Prusa Connect / Link Gcode Corruption on OTA gcode upload #3156