openwall / john

John the Ripper jumbo - advanced offline password cracker, which supports hundreds of hash and cipher types, and runs on many operating systems, CPUs, GPUs, and even some FPGAs
https://www.openwall.com/john/
Other
10.27k stars 2.1k forks source link

ztex: libusb error followed by Timeout when uploading firmware to ~12+ devices #2600

Open roycewilliams opened 7 years ago

roycewilliams commented 7 years ago

During the firmware upload that happens after initial power-up, if more than about 12 boards (it seems to vary) are present, a segmentation fault occurs:

SN XXXXXXXXXX: firmware uploaded SN XXXXXXXXXX: firmware uploaded SN XXXXXXXXXX: firmware uploaded SN XXXXXXXXXX: firmware uploaded SN XXXXXXXXXX: firmware uploaded SN XXXXXXXXXX: firmware uploaded SN XXXXXXXXXX: firmware uploaded SN XXXXXXXXXX: firmware uploaded SN XXXXXXXXXX: firmware uploaded SN XXXXXXXXXX: firmware uploaded SN XXXXXXXXXX: firmware uploaded SN XXXXXXXXXX: firmware uploaded SN XXXXXXXXXX: firmware uploaded Segmentation fault

The workaround is to simply re-run john with the same parameters again. The upload picks up where it left off, uploading firmware to any remaining devices, and then proceeding to upload bitstreams and start the job.

This is speculation, but it's almost as though there's a problem with uploading firmware to more than a certain specific number of devices in a row, or after a certain amount of data is moved through the process, or a certain amount of processing or wallclock time, etc.

If I had a much larger cluster, my guess is that the second run would segfault after another 12 or so devices, and the workaround would need to be reapplied until the last group of devices falls under the segfault threshold, at which point work would begin.

This happens for both descrypt-ztex and bcrypt-ztex.

solardiz commented 7 years ago

@Apingis, this issue is yours to handle, thanks!

@magnumripper How can we assign issues to @Apingis? Perhaps you need to add him as a member or something? (I'm still not familiar with GitHub.)

claudioandre-br commented 7 years ago

To assign a task to someone, he or she has to be a Collaborator, which means to have push access to the repository. That said, maybe is it time to create the Openwall organization and have a more granular control?

magnumripper commented 7 years ago

From the looks of it, I invited him as collaborator long ago but he never responded. I guess he somehow missed it. I'll try to re-invite.

solardiz commented 6 years ago

Can we somehow simulate this problem with fewer devices? Maybe hack the code to forget that firmware has already been uploaded to a device, and to re-upload a few times? Otherwise chances are this issue won't be investigated nor fixed, since it appears that Royce is the only one here with a cluster large enough to trigger the issue.

roycewilliams commented 6 years ago

I can narrow this down if needed. IIRC it was around the 12-board mark, but that was a while ago and the code may have changed since then.

solardiz commented 6 years ago

Royce, the number 12 (or whatever) isn't as helpful as a gdb backtrace on a debugging build would be. Thanks.

solardiz commented 5 years ago

As I recall discussing this issue with @Apingis via e-mail, we couldn't see a bug in our code, and suspected it's a libusb issue. We need a gdb backtrace or the like to proceed to investigate this further.

Meanwhile, as I understand @roycewilliams is happy with the workaround (simply re-running the command, which is only needed the first time after powering-on the boards), and no one else has clusters of this size (e.g., I only used up to 4 boards at once).

roycewilliams commented 5 years ago

Apologies for the delay on this. Here is what I found. I have also enabled ZTEX_DEBUG=1.

Naively, perhaps it is an ARM-specific include issue? This might explain why it was not easy to recreate the issue, as most people may not be using a Raspberry Pi to drive their ZTEXes.

$ gdb ./john
GNU gdb (Raspbian 7.12-6) 7.12.0.20161007-git
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "arm-linux-gnueabihf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./john...done.

(gdb) run --progress-every=60 --format=bcrypt-ztex --wordlist=[/path/to/]rockyou.txt.7440 [hashlist]

[...]
ztex_scan_new_devices: SN 23 productId: 10.15
ztex_scan_new_devices: USB 221a 0100
ztex_scan_new_devices: SN 24 productId: 10.15
ztex_scan_new_devices: USB 221a 0100

Thread 1 "john" received signal SIGSEGV, Segmentation fault.
strlen () at ../sysdeps/arm/armv6/strlen.S:26
26      ../sysdeps/arm/armv6/strlen.S: No such file or directory.
(gdb) where
#0  strlen () at ../sysdeps/arm/armv6/strlen.S:26
#1  0x76a72690 in _IO_vfprintf_internal (s=0x7effcac8, format=0x4b0de4 "ztex_device_new: libusb_get_string_descriptor_ascii(iSerialNumber): %s\n", ap=...) at vfprintf.c:1637
#2  0x76a730e8 in buffered_vfprintf (s=s@entry=0x76b6ccb0 <_IO_2_1_stderr_>, format=<optimized out>, args=...) at vfprintf.c:2325
#3  0x76a709cc in _IO_vfprintf_internal (s=0x76b6ccb0 <_IO_2_1_stderr_>, format=0x4b0de4 "ztex_device_new: libusb_get_string_descriptor_ascii(iSerialNumber): %s\n", ap=...) at vfprintf.c:1293
#4  0x00323230 in ztex_error ()
#5  0x00323a50 in ztex_device_new ()
#6  0x00323c30 in ztex_scan_new_devices ()
#7  0x00324af0 in ztex_scan ()
#8  0x00324ddc in ztex_init_scan ()
#9  0x0031e234 in device_init_scan ()
#10 0x003218e4 in jtr_device_list_init ()
#11 0x0031eb8c in device_format_reset ()
#12 0x002a3eb0 in fmt_self_test_body (full_lvl=<optimized out>, db=0x4742050, salt_copy=0x11dd68bc, binary_copy=0x11dd68ac, format=0x528178 <fmt_ztex_bcrypt>) at formats.c:509
#13 fmt_self_test (format=0x528178 <fmt_ztex_bcrypt>, db=db@entry=0x4742050) at formats.c:1607
#14 0x002ad2bc in john_run () at john.c:1738
#15 main (argc=<optimized out>, argv=<optimized out>) at john.c:2117
(gdb) list
21      in ../sysdeps/arm/armv6/strlen.S
(gdb)
solardiz commented 5 years ago

Thank you for following up on this, Royce!

perhaps it is an ARM-specific include issue?

Perhaps not. Please try the below patch, which should change the segfault into an error message and will hopefully help us expose the real issue:

+++ b/src/ztex/ztex.c
@@ -28,7 +28,11 @@

 int ZTEX_DEBUG = 0;

-void ztex_error(const char *s, ...) {
+#ifdef __GNUC__
+__attribute__ ((format (printf, 1, 2)))
+#endif
+static void ztex_error(const char *s, ...)
+{
        va_list ap;
        va_start(ap, s);
        vfprintf(stderr, s, ap);
@@ -81,7 +85,7 @@ int ztex_device_new(libusb_device *usb_dev, struct ztex_device **ztex_dev)
        *ztex_dev = NULL;
        struct ztex_device *dev = malloc(sizeof(struct ztex_device));
        if (!dev) {
-               ztex_error("malloc: unable to allocate %d bytes\n", sizeof(struct ztex_device));
+               ztex_error("malloc: unable to allocate %u bytes\n", (unsigned int)sizeof(struct ztex_device));
                return -1;
        }

@@ -120,7 +124,7 @@ int ztex_device_new(libusb_device *usb_dev, struct ztex_device **ztex_dev)
        result = libusb_get_string_descriptor_ascii(dev->handle, desc.iSerialNumber,
                        (unsigned char *)dev->snString_orig, ZTEX_SNSTRING_LEN);
        if (result < 0) {
-               ztex_error("ztex_device_new: libusb_get_string_descriptor_ascii(iSerialNumber): %s\n",
+               ztex_error("ztex_device_new: libusb_get_string_descriptor_ascii(iSerialNumber) returns %d (%s)\n",
                                result, libusb_error_name(result));
                ztex_device_delete(dev);
                return result;
@@ -133,7 +137,7 @@ int ztex_device_new(libusb_device *usb_dev, struct ztex_device **ztex_dev)
        result = libusb_get_string_descriptor_ascii(dev->handle, desc.iProduct,
                        (unsigned char *)dev->product_string, ZTEX_PRODUCT_STRING_LEN);
        if (result < 0) {
-               ztex_error("ztex_device_new: libusb_get_string_descriptor_ascii(iProduct): %s\n",
+               ztex_error("ztex_device_new: libusb_get_string_descriptor_ascii(iProduct) returns %d (%s)\n",
                                result, libusb_error_name(result));
                ztex_device_delete(dev);
                return result;
@@ -618,8 +622,8 @@ int ztex_upload_bitstream(struct ztex_device *dev, FILE *fp)
        int endpointHS = settings[0];
        int interfaceHS = settings[1];
        if (endpointHS <= 0 || interfaceHS < 0) {
-               ztex_error("SN %s: invalid HS Fpga Settings\n",
-                               dev->snString, result, libusb_error_name(result));
+               ztex_error("SN %s: invalid High-Speed FPGA settings %d, %d\n",
+                               dev->snString, endpointHS, interfaceHS);
                return -1;
        }

@@ -683,7 +687,7 @@ int ihx_load_data(struct ihx_data *ihx_data, FILE *fp)
        rewind(fp);
        char *file_data = malloc(file_size);
        if (!file_data) {
-               ztex_error("ihx_load_data: malloc(%d) failed\n", file_size);
+               ztex_error("ihx_load_data: malloc(%ld) failed\n", file_size);
                return -1;
        }

The added attribute exposed the following warnings, which further hunks of this patch fix:

ztex.c: In function 'ztex_device_new':
ztex.c:88:69: warning: format '%d' expects argument of type 'int', but argument 2 has type 'long unsigned int' [-Wformat]
ztex.c:128:5: warning: format '%s' expects argument of type 'char *', but argument 2 has type 'int' [-Wformat]
ztex.c:128:5: warning: too many arguments for format [-Wformat-extra-args]
ztex.c:141:5: warning: format '%s' expects argument of type 'char *', but argument 2 has type 'int' [-Wformat]
ztex.c:141:5: warning: too many arguments for format [-Wformat-extra-args]
ztex.c: In function 'ztex_upload_bitstream':
ztex.c:626:5: warning: too many arguments for format [-Wformat-extra-args]
ztex.c: In function 'ihx_load_data':
ztex.c:690:3: warning: format '%d' expects argument of type 'int', but argument 2 has type 'long int' [-Wformat]
roycewilliams commented 5 years ago

After the patch:

$ ./john --progress-every=60 --format=bcrypt-ztex --wordlist=[/path/to]/rockyou.txt.7440 hashlist
Using default input encoding: UTF-8
Loaded 721151 password hashes with 721151 different salts (bcrypt-ztex [Blowfish ZTEX])
Remaining 721149 password hashes with 721149 different salts
Cost 1 (iteration count) is 4096 for all loaded hashes
SN 25: firmware uploaded
SN 24: firmware uploaded
SN 23: firmware uploaded
SN 22: firmware uploaded
SN 21: firmware uploaded
SN 20: firmware uploaded
SN 19: firmware uploaded
SN 18: firmware uploaded
SN 17: firmware uploaded
SN 16: firmware uploaded
SN 15: firmware uploaded
SN 14: firmware uploaded
SN 13: firmware uploaded
SN 12: firmware uploaded
SN 11: firmware uploaded
SN 10: firmware uploaded
ztex_device_new: libusb_get_string_descriptor_ascii(iSerialNumber) returns -1 (LIBUSB_ERROR_IO)
ztex_device_new: libusb_get_string_descriptor_ascii(iSerialNumber) returns -1 (LIBUSB_ERROR_IO)
ztex_device_new: libusb_get_string_descriptor_ascii(iSerialNumber) returns -1 (LIBUSB_ERROR_IO)
9 device(s) lost after firmware upload
SN 18: uploading bitstreams.. ok
SN 12: uploading bitstreams.. ok
SN 17: uploading bitstreams.. ok
SN 15: uploading bitstreams.. ok
SN 22: uploading bitstreams.. ok
SN 25: uploading bitstreams.. ok
SN 19: uploading bitstreams.. ok
ZTEX 18 bus:1 dev:121 Frequency:150 150 150 150
ZTEX 12 bus:1 dev:120 Frequency:150 150 150 150
ZTEX 17 bus:1 dev:119 Frequency:150 150 150 150
ZTEX 15 bus:1 dev:118 Frequency:150 150 150 150
ZTEX 22 bus:1 dev:117 Frequency:150 150 150 150
ZTEX 25 bus:1 dev:116 Frequency:150 150 150 150
ZTEX 19 bus:1 dev:115 Frequency:150 150 150 150
7 device(s) ZTEX 1.15y ready
Note: This format may be a lot faster with --mask acceleration (see doc/MASK).
Warning: Slow communication channel to the device. Increase mask or expect performance degradation.
Press 'q' or Ctrl-C to abort, almost any other key for status
Warning: Only 7440 candidates left, minimum 524288 needed for performance.
Warning: TargetSetting=5, processing hash with setting=12, expecting suboptimal performance or timeout, consider to adjust TargetSetting in john.conf
SN 10: uploading bitstreams.. ok
SN 13: uploading bitstreams.. ok
SN 20: uploading bitstreams.. ok
SN 23: uploading bitstreams.. ok
SN 11: uploading bitstreams.. ok
SN 16: uploading bitstreams.. ok
SN 14: uploading bitstreams.. ok
SN 21: uploading bitstreams.. ok
SN 24: uploading bitstreams.. ok
Found 9 device(s) ZTEX 1.15y
SN: 24 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:122
SN: 21 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:123
SN: 14 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:124
SN: 16 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:125
SN: 11 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:126
SN: 23 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:127
SN: 20 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:11
SN: 13 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:12
SN: 10 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:13
SN 18: Timeout.
SN 12: Timeout.
SN 17: Timeout.
SN 15: Timeout.
SN 22: Timeout.
SN 25: Timeout.
SN 19: Timeout.
Found 7 device(s) ZTEX 1.15y
SN: 19 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:115
SN: 25 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:116
SN: 22 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:117
SN: 15 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:118
SN: 17 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:119
SN: 12 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:120
SN: 18 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:121
0g 0:00:01:00 DONE (2019-05-16 16:24) 0g/s 0p/s 9711c/s 9711C/s 123456..carole
0g 0:00:02:00 DONE (2019-05-16 16:25) 0g/s 0p/s 11675c/s 11675C/s 123456..carole
0g 0:00:03:00 DONE (2019-05-16 16:26) 0g/s 0p/s 12335c/s 12335C/s 123456..carole
0g 0:00:04:00 DONE (2019-05-16 16:27) 0g/s 0p/s 12672c/s 12672C/s 123456..carole
solardiz commented 5 years ago

Thanks, Royce. Please retest after fixing this to match your hashes, so that the timeouts would be gone:

Warning: TargetSetting=5, processing hash with setting=12, expecting suboptimal performance or timeout, consider to adjust TargetSetting in john.conf

I think we're almost done here - we got errors from libusb like we expected, and there's probably little we can do about them. The devices are later rediscovered, and this annoyance will only be happening the first time you run JtR after devices power-on as they retain our firmware then.

solardiz commented 5 years ago

Unrelated: looks like your 7440 words were tuned as the lowest sensible for 15 boards (as 124*4*15), but you now have 16.

roycewilliams commented 5 years ago

That 16th board comes and goes, and I cannot yet predict when - so I assume worst case. :)

With your patch, users will be able to correctly diagnose that the issue is libusb and not some other issue, and the segfault is avoided, so I support your idea that this is approaching done.

Properly tuned for cost 12:

Using default input encoding: UTF-8
Loaded 721151 password hashes with 721151 different salts (bcrypt-ztex [Blowfish ZTEX])
Remaining 721149 password hashes with 721149 different salts
Cost 1 (iteration count) is 4096 for all loaded hashes
SN 25: firmware uploaded
SN 24: firmware uploaded
SN 23: firmware uploaded
SN 22: firmware uploaded
SN 21: firmware uploaded
SN 20: firmware uploaded
SN 19: firmware uploaded
SN 18: firmware uploaded
SN 16: firmware uploaded
SN 15: firmware uploaded
SN 14: firmware uploaded
SN 13: firmware uploaded
SN 12: firmware uploaded
SN 11: firmware uploaded
SN 10: firmware uploaded
ztex_device_new: libusb_get_string_descriptor_ascii(iSerialNumber) returns -1 (LIBUSB_ERROR_IO)
ztex_device_new: libusb_get_string_descriptor_ascii(iSerialNumber) returns -1 (LIBUSB_ERROR_IO)
ztex_device_new: libusb_get_string_descriptor_ascii(iSerialNumber) returns -1 (LIBUSB_ERROR_IO)
8 device(s) lost after firmware upload
SN 24: uploading bitstreams.. ok
SN 19: uploading bitstreams.. ok
SN 16: uploading bitstreams.. ok
SN 12: uploading bitstreams.. ok
SN 22: uploading bitstreams.. ok
SN 15: uploading bitstreams.. ok
SN 25: uploading bitstreams.. ok
ZTEX 24 bus:1 dev:39 Frequency:150 150 150 150
ZTEX 19 bus:1 dev:38 Frequency:150 150 150 150
ZTEX 16 bus:1 dev:37 Frequency:150 150 150 150
ZTEX 12 bus:1 dev:36 Frequency:150 150 150 150
ZTEX 22 bus:1 dev:35 Frequency:150 150 150 150
ZTEX 15 bus:1 dev:34 Frequency:150 150 150 150
ZTEX 25 bus:1 dev:33 Frequency:150 150 150 150
7 device(s) ZTEX 1.15y ready
Note: This format may be a lot faster with --mask acceleration (see doc/MASK).
Press 'q' or Ctrl-C to abort, almost any other key for status
SN 10: uploading bitstreams.. ok
SN 20: uploading bitstreams.. ok
SN 13: uploading bitstreams.. ok
SN 23: uploading bitstreams.. ok
SN 18: uploading bitstreams.. ok
SN 11: uploading bitstreams.. ok
SN 21: uploading bitstreams.. ok
SN 14: uploading bitstreams.. ok
Found 8 device(s) ZTEX 1.15y
SN: 14 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:40
SN: 21 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:41
SN: 11 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:42
SN: 18 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:43
SN: 23 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:44
SN: 13 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:45
SN: 20 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:46
SN: 10 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:47
SN 24: Timeout.
SN 19: Timeout.
SN 16: Timeout.
SN 12: Timeout.
SN 22: Timeout.
SN 15: Timeout.
SN 25: Timeout.
Found 7 device(s) ZTEX 1.15y
SN: 25 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:33
SN: 15 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:34
SN: 22 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:35
SN: 12 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:36
SN: 16 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:37
SN: 19 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:38
SN: 24 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:39
0g 0:00:01:00 93.22% (ETA: 02:26:32) 0g/s 0p/s 9474c/s 9474C/s 123456..bedroom
0g 0:00:02:00 93.22% (ETA: 02:27:36) 0g/s 0p/s 11154c/s 11154C/s 123456..bedroom
0g 0:00:03:00 93.22% (ETA: 02:28:42) 0g/s 0p/s 11727c/s 11727C/s 123456..bedroom
0g 0:00:04:00 93.22% (ETA: 02:29:46) 0g/s 0p/s 12012c/s 12012C/s 123456..bedroom
0g 0:00:05:00 93.22% (ETA: 02:30:50) 0g/s 0p/s 12179c/s 12179C/s 123456..bedroom
0g 0:00:06:00 93.22% (ETA: 02:31:55) 0g/s 0p/s 12285c/s 12285C/s 123456..bedroom
0g 0:00:07:00 93.22% (ETA: 02:32:58) 0g/s 0p/s 12363c/s 12363C/s 123456..bedroom
[.]
1g 0:00:08:00 93.22% (ETA: 02:34:03) 0.002081g/s 0p/s 12418c/s 12418C/s 123456..bedroom
1g 0:00:09:00 93.22% (ETA: 02:35:08) 0.001850g/s 0p/s 12465c/s 12465C/s 123456..bedroom
1g 0:00:10:00 93.22% (ETA: 02:36:11) 0.001666g/s 0p/s 12505c/s 12505C/s 123456..bedroom
1g 0:00:11:00 93.22% (ETA: 02:37:15) 0.001514g/s 0p/s 12538c/s 12538C/s 123456..bedroom
1g 0:00:12:00 93.22% (ETA: 02:38:21) 0.001388g/s 0p/s 12559c/s 12559C/s 123456..bedroom
[.]
2g 0:00:13:00 93.22% (ETA: 02:39:25) 0.002562g/s 0p/s 12578c/s 12578C/s 123456..bedroom
[.]
3g 0:00:14:00 93.22% (ETA: 02:40:29) 0.003570g/s 0p/s 12594c/s 12594C/s 123456..bedroom
[.]
4g 0:00:15:00 93.22% (ETA: 02:41:34) 0.004442g/s 0p/s 12610c/s 12610C/s 123456..bedroom
[.]
4g 0:00:16:00 93.22% (ETA: 02:42:38) 0.004164g/s 0p/s 12623c/s 12623C/s 123456..bedroom
4g 0:00:17:00 93.22% (ETA: 02:43:43) 0.003919g/s 0p/s 12635c/s 12635C/s 123456..bedroom
4g 0:00:18:00 93.22% (ETA: 02:44:47) 0.003701g/s 0p/s 12646c/s 12646C/s 123456..bedroom
[.]
5g 0:00:19:00 93.22% (ETA: 02:45:51) 0.004383g/s 0p/s 12657c/s 12657C/s 123456..bedroom
5g 0:00:20:00 93.22% (ETA: 02:46:55) 0.004166g/s 0p/s 12665c/s 12665C/s 123456..bedroom
[.]
6g 0:00:21:00 93.22% (ETA: 02:48:00) 0.004759g/s 0p/s 12675c/s 12675C/s 123456..bedroom
6g 0:00:22:00 93.22% (ETA: 02:49:04) 0.004543g/s 0p/s 12683c/s 12683C/s 123456..bedroom
[.]
7g 0:00:23:00 93.22% (ETA: 02:50:09) 0.005071g/s 0p/s 12691c/s 12691C/s 123456..bedroom
[.]
8g 0:00:24:00 93.22% (ETA: 02:51:13) 0.005553g/s 0p/s 12696c/s 12696C/s 123456..bedroom
[.]
9g 0:00:25:00 93.22% (ETA: 02:52:18) 0.005998g/s 0p/s 12700c/s 12700C/s 123456..bedroom
solardiz commented 5 years ago

Thanks, Royce. In your latest test, somehow 7 devices timeout and are rediscovered, but this looks like a system issue (libusb, udev, kernel, or/and hardware) and probably nothing we can reasonably deal with here. So I intend to apply the patch avoiding the segfault, and close this issue.

roycewilliams commented 5 years ago

If it's a system issue, it's only triggered by the difference between the previous behavior and the behavior of this patch.

Without this patch, I can consistently upload firmware to all devices (which then segfaults), and then when I re-run the command, I can consistently upload bitstreams to all devices. With this patch, I consistently get the behavior above.

roycewilliams commented 5 years ago

3982 does address the segfault:

$ ./john --progress-every=60 --format=bcrypt-ztex --wordlist=/var/local/hashbench/wordlists/rockyou.txt.7440 hashes.txt
Using default input encoding: UTF-8
Loaded 721151 password hashes with 721151 different salts (bcrypt-ztex [Blowfish ZTEX])
Remaining 721140 password hashes with 721140 different salts
Cost 1 (iteration count) is 4096 for all loaded hashes
SN 25: firmware uploaded
SN 24: firmware uploaded
SN 23: firmware uploaded
SN 22: firmware uploaded
SN 21: firmware uploaded
SN 20: firmware uploaded
SN 19: firmware uploaded
SN 18: firmware uploaded
SN 16: firmware uploaded
SN 15: firmware uploaded
SN 14: firmware uploaded
SN 13: firmware uploaded
SN 12: firmware uploaded
SN 11: firmware uploaded
SN 10: firmware uploaded
ztex_device_new: libusb_get_string_descriptor_ascii(iSerialNumber) returns -1 (LIBUSB_ERROR_IO)
ztex_device_new: libusb_get_string_descriptor_ascii(iSerialNumber) returns -1 (LIBUSB_ERROR_IO)
ztex_device_new: libusb_get_string_descriptor_ascii(iSerialNumber) returns -1 (LIBUSB_ERROR_IO)
7 device(s) lost after firmware upload
SN 18: uploading bitstreams.. ok
SN 21: uploading bitstreams.. ok
SN 16: uploading bitstreams.. ok
SN 12: uploading bitstreams.. ok
SN 15: uploading bitstreams.. ok
SN 25: uploading bitstreams.. ok
SN 19: uploading bitstreams.. ok
SN 22: uploading bitstreams.. ok
ZTEX 18 bus:1 dev:98 Frequency:150 150 150 150
ZTEX 21 bus:1 dev:97 Frequency:150 150 150 150
ZTEX 16 bus:1 dev:96 Frequency:150 150 150 150
ZTEX 12 bus:1 dev:95 Frequency:150 150 150 150
ZTEX 15 bus:1 dev:94 Frequency:150 150 150 150
ZTEX 25 bus:1 dev:93 Frequency:150 150 150 150
ZTEX 19 bus:1 dev:92 Frequency:150 150 150 150
ZTEX 22 bus:1 dev:91 Frequency:150 150 150 150
8 device(s) ZTEX 1.15y ready
Note: This format may be a lot faster with --mask acceleration (see doc/MASK).
Press 'q' or Ctrl-C to abort, almost any other key for status
Warning: Only 7440 candidates left, minimum 7936 needed for performance.
SN 10: uploading bitstreams.. ok
SN 13: uploading bitstreams.. ok
SN 23: uploading bitstreams.. ok
SN 20: uploading bitstreams.. ok
SN 11: uploading bitstreams.. ok
SN 14: uploading bitstreams.. ok
SN 24: uploading bitstreams.. ok
Found 7 device(s) ZTEX 1.15y
SN: 24 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:99
SN: 14 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:100
SN: 11 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:101
SN: 20 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:102
SN: 23 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:103
SN: 13 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:104
SN: 10 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:105
SN 18: Timeout.
SN 21: Timeout.
SN 16: Timeout.
SN 12: Timeout.
SN 15: Timeout.
SN 25: Timeout.
SN 19: Timeout.
SN 22: Timeout.
Found 8 device(s) ZTEX 1.15y
SN: 22 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:91
SN: 19 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:92
SN: 25 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:93
SN: 15 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:94
SN: 12 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:95
SN: 16 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:96
SN: 21 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:97
SN: 18 productId: 10.15.0.0 "inouttraffic JtR 1.8.x" busnum:1 devnum:98
0g 0:00:01:00 DONE (2019-05-22 02:53) 0g/s 0p/s 10119c/s 10119C/s 123456..carole

Not segfaulting is an obvious win and resolves this issue. But the trade-off seems odd to me, when prior to these patches, the behavior that I noted immediately above presents the way that it does.

magnumripper commented 5 years ago

On an unrelated note, why does it say JtR 1.8.x here?

roycewilliams commented 5 years ago

I don't immediately see where you're seeing that. Here's the version that was my latest test above:

John the Ripper 1.9.0-jumbo-1+bleeding-8d9ff4ac3 2019-05-21 23:31:13 +0200 OMP [linux-gnueabihf 32-bit armv7l NEON AC]

Edit: ah, I see - in the 'inouttraffic' component. I assume that since it's entirely included in JtR that it's just cosmetic?

roycewilliams commented 5 years ago

Could this be something as simple as just needed a longer wait time between the firmware upload and the bitstream upload?

Maybe @Apingis could weigh in, both for the 1.8.x and for the uploading timing issue.

solardiz commented 5 years ago

why does it say JtR 1.8.x here?

This is somewhat tricky to edit, so I chose not to bother, nor to bother @Apingis with a request to make this edit prior to our release (especially given that he was mostly unavailable at the time). I imagine the easiest way to edit this string is in the .ihx file, without actually recompiling the firmware, and in the host code for what string to expect. Should we maybe use JtR-unrelated versioning scheme there, so that this doesn't become confusing each time we make a new major release?

the uploading timing issue.

This could indeed be timing, or it could be us no longer restarting the initialization from scratch like you would have with the segfault. We could workaround the issue by mimicking the old behavior more closely, just without the segfault, but this feels in some ways wrong. The easiest we can do is have the process print libusb's error message and exit, so that you'd restart. Would you be more comfortable with that than with the current segfault fix? A much more complex and error-prone approach is to mimick the same, including your manual program restart, but without actual manual work for you.

I also would like to hear any thoughts @Apingis might have on this.

roycewilliams commented 5 years ago

Re the potential timing issue, since the current workaround is even easier than the previous one, I leave it to your best judgement depending on other priorities. That being said, my (admittedly naive) theory is that there's something inherent in how the boards are designed that triggers this issue, rather than a USB reliability problem or something that is unique my particular boards or setup - especially since other software that uses these boards (such as the reference FWLoader) and the original Bitcoin-mining software doesn't trigger this symptom.

solardiz commented 5 years ago

My best judgement is to close this issue and be done with it. But since you're the only user affected (currently and as far as we're aware), I asked whether it'd be more convenient for you if we had the program exit upon encountering this error, so you'd manually restart and avoid the timeouts. Can you answer this directly?

theory is that there's something inherent in how the boards are designed that triggers this issue

Hardly, especially given that the issue doesn't show up with fewer boards, where the timings are tougher (less time passed between firmware upload and bitstream upload, and between bitstream upload and start of cracking).

What happens here is that we're stressing the USB subsystem. As far as the (software) USB subsystem is concerned, these devices disappear and reappear as slightly different ones when we replace the firmware. Having a lot of devices disappear and reappear in a short period of time is stressful for that subsystem.

other software that uses these boards (such as the reference FWLoader) and the original Bitcoin-mining software doesn't trigger this symptom.

They probably don't upload new firmware to all boards in such rapid succession, yet expect them to be available in the system again and usable almost right away. IIRC, you run FWLoader as a separate program, and then the miner as another separate program, so it's similar to the workaround you had been using with JtR back when it'd segfault the first time.

roycewilliams commented 5 years ago

I don't recall about the ZTEX reference btcminer, but others (like cgminer) handle the entire sequence.

Making a device disappear and then reappear sounds like it's inherent to how USB works. To mean, this means that it's not an USB error, but rather than JtR is basically trying too soon (and those attempts to try will never work, so why try that soon?) In other words, it sounds like a (minor, but nevertheless actual) design flaw to me.

But if addressing that issue isn't worth the cycles (and I'd totally understand if it wasn't), speaking personally, I'd prefer than John proceed, rather than exit.

But if so, I also think that this should be clearly documented in ZTEX that this symptom is inherent in how USB works and that the current behavior is something that JtR is choosing to do, rather than an actual USB problem. Otherwise, any (hypothentical) future users will be confused and will be trying to troubleshoot an issue that isn't an issue at all.

I would also suggest that a message just prior to the load - something like "initial LIBUSB errors during firmware upload for larger groups of boards are due to natural USB device resets (not true errors)"

roycewilliams commented 5 years ago

Naively, couldn't this be resolved in a way that is true to the nature of how USB works, with something super-simple like (pseudocode):

load_firmware()

if (ztex_count > 6) { echo "Waiting a few seconds for USB devices to settle ..." sleep 3 }

?

solardiz commented 5 years ago

Does cgminer upload firmware? I used it for testing of ZTEX boards before, and don't recall it doing that. I also don't see that in the GitHub repo you referenced now. I think it just assumes suitable firmware is already in there (perhaps coming from EEPROM).

JtR is basically trying too soon (and those attempts to try will never work, so why try that soon?)

Even if so (which isn't known for sure yet), how would JtR know? It only attempts to use devices that the USB subsystem already recognizes. Formally, it doesn't try to do anything that is bound to fail. We only know that things fail from your report.

We could add warnings and documentation, but we don't know whether the same symptoms would be seen on any other system than yours.

Please feel free to implement and test the workaround you propose, and let us know if it works. (I doubt that it will, but it might. There might be more involved which helps avoid the issue when you were doing a full restart like before.) If it does help, we could then implement this perhaps with the "board count threshold" and "delay in seconds" parameters as john.conf settings. Thanks!

solardiz commented 5 years ago

BTW, one way to avoid the problem would be to flash our firmware into EEPROM. Then JtR will skip firmware upload just like it does when you're using it for the second time and on. But since firmware update in EEPROM is a risky procedure (even if officially supported by ZTEX), we don't recommend it.

roycewilliams commented 5 years ago

Ah, you're right - it's been long enough now that I'd forgotten about that separation.

I'll try to add a simple test to see if it helps.