Does not work on 32-bit ARM (or maybe any 32-bit arch?) with raw binary data

romanrm commented 7 months ago

With a source code like this:

<?php
require("base85.class.php");

$encoded = base85::encode(md5("test", true));
echo $encoded, "\n";
?>

Proper output on amd64 looks like this:

$ php test.php 
$'/lH7Np6%b2,mG-7?1o

But on ARMv7 (32-bit) results in a garbled mess (also binary data in output):

$ php test.php 
�-7?1oNp6%

PHP version is the same on both, 7.4

scottchiefbaker commented 7 months ago

I suspect this has something to do with 64bit vs 32bit. Testing 32bit PHP is not something I can do easily. Are you able to run the unit tests? Do they all fail?

The ONLY thing in the code that should care about 32bit would be the pack() and unpack() calls, but that seems to already be tied to the 32 bit calls?

scottchiefbaker commented 7 months ago

I forgot I have an old Raspberry Pi 3 that I could test with. These tests were run on that Pi which is running PHP v7.4.33. If I take the above sample input and do some debugging I see this:

$input = hex2bin(098f6bcd4621d373cade4e832627b4f6);
Length: 16 bytes / Padding: 0

int(160394189)
int(1176621939)
int(-891400573)
int(640136438)

We're seeing a total of 16 bytes which we break up in 4x 32bit unsigned longs using unpack().

If I run the same code on my x86_64 Ryzen box I get:

$input = hex2bin(098f6bcd4621d373cade4e832627b4f6);
Length: 16 bytes / Padding: 0

int(160394189)
int(1176621939)
int(3403566723)
int(640136438)

Something is funky with unpack(N*) on 32bit ARM, but only for that third chunk? If I look at the unpack() docs it clearly says that N is an unsigned 32 bit long. If the docs are correct it should be impossible to get a negative number. I'm guessing this is a PHP bug of some kind. I suspect this library will work just fine if we can figure out what unpack() is doing wrong.

scottchiefbaker commented 7 months ago

I upgraded to Debian 12 which gives me PHP v8.2.7 and I'm getting the same error with the weird third chunk. Not sure what the next steps are. Either:

I'm misunderstanding unpack()
It's a PHP bug

romanrm commented 7 months ago

See this comment, seems to be related: https://www.php.net/manual/en/function.unpack.php#106041

scottchiefbaker commented 7 months ago

That gets us closer... I was able to refactor the code to not use unpack() with some trickery from that comment. Now I get the correct values (kind-of):

string(9) "160394189"
string(10) "1176621939"
string(10) "3403566723"
string(9) "640136438"

I need those as integers not strings, and when I try and convert that third byte with intval() it truncates it to PHP_INT_MAX because it's greater than 2**31.

int(160394189)
int(1176621939)
int(2147483647)
int(640136438)

Not to mention that we still have to do some 64 bit math with this number later to calculate the appropriate base85 character. Without a massive refactor I don't know if I can get this to work. We rely on 64bit math under the hood.

scottchiefbaker commented 7 months ago

With some help from ChatGPT I was able to figure out how to use gmp with PHP to do math with numbers larger than 2^32. Check out the latest code in the repo and see if it works for you. I'm passing all the units tests on 32bit now.

$ php tests/tests.php
Encode: Null                                                                    [  OK  ]
Encode: Four nulls                                                              [  OK  ]
Encode: Single space                                                            [  OK  ]
Encode: Four spaces = 'y'                                                       [  OK  ]
Encode: Null in middle                                                          [  OK  ]
Encode: Four null in middle                                                     [  OK  ]
Encode: Hello world                                                             [  OK  ]
Encode: Quick brown fox                                                         [  OK  ]
Encode: String #1                                                               [  OK  ]
Encode: String #2                                                               [  OK  ]
Encode: String #3                                                               [  OK  ]
Encode: String #4                                                               [  OK  ]
Encode: Unprintable chars                                                       [  OK  ]
Encode: Github issue #2                                                         [  OK  ]

All 14 tests passed

I'm not doing anything special in decode() yet. I suspect that is still broken on 32bit, but I haven't tested.

scottchiefbaker commented 7 months ago

I went ahead and added a bunch of decode() units tests and they all pass fine on 64 and 32bit. I guess the decode() code isn't affected. I think we're 100% now?

Please test and let me know.

romanrm commented 7 months ago

I encoded and decoded a PNG file, and it came out identical by checksum, so it works.

Bad news though: it is really really slow.

On a small ARMv7 system at 1 GHz (Allwinner A20) it took 5 minutes 17 seconds to encode 1 MB. For comparison, encoding the same test file using PHP's base64_encode, takes 0.25 seconds.

Decoding the base85 file took 3 seconds, so that part is fine.

scottchiefbaker commented 7 months ago

That doesn't really surprise me. I imagine doing a ton of intensive math with gmp on a 32bit processor adds a bunch of overhead to work around the limitations of the CPU. I don't know enough about things to optimize it much farther. If you think you can speed it up I'm open to PRs. The unit tests are pretty comprehensive now so it should be easy to test if things are working or not.

The world has moved on to 64bit CPUs. After diving into all this madness, I now see why :)

romanrm commented 7 months ago

Would it be possible to get back to the initial method as shown in https://github.com/scottchiefbaker/php-base85/issues/2#issuecomment-2038821556 and only use a GMP codepath if the number ends up being negative (indicating overflow)?

But overall I would agree it's not worth it to spend a ton of effort and add much complexity just to better support 32-bit systems. As is, even a slow fallback is fine, main thing is that it now gives correct results. Only if someone would see this as a cool hacking project or a challenge to solve it in a nice way. :)

scottchiefbaker commented 7 months ago

Oh that's an interesting idea... I just landed some code that only goes the slow/gmp way if it can't do the math in 32bit. Try the latest commit and see if it's faster.

romanrm commented 7 months ago

It now takes 12 seconds to encode, but the encoded result is different than before, and decoding it results in a differing PNG file, that's also broken and not viewable. Did your unit tests pass? Do they include encoding and verifying binary data, and not just ASCII (probably not)?

romanrm commented 7 months ago

Here's how it is different, for a general idea: 2024-04-06T122437Z-b85

scottchiefbaker commented 7 months ago

Interesting... all of my unit tests are passing. Clearly I'm missing something though. Take a look at the unit tests and see if you think I'm missing anything. I'm open to PRs for new unit tests.

$ php tests/tests.php
Encode: Null                                                                    [  OK  ]
Encode: Four nulls                                                              [  OK  ]
Encode: Single space                                                            [  OK  ]
Encode: Four spaces = 'y'                                                       [  OK  ]
Encode: Null in middle                                                          [  OK  ]
Encode: Four null in middle                                                     [  OK  ]
Encode: Hello world                                                             [  OK  ]
Encode: Quick brown fox                                                         [  OK  ]
Encode: String #1                                                               [  OK  ]
Encode: String #2                                                               [  OK  ]
Encode: String #3                                                               [  OK  ]
Encode: String #4                                                               [  OK  ]
Encode: Unprintable chars                                                       [  OK  ]
Encode: Github issue #2                                                         [  OK  ]
Decode: Null                                                                    [  OK  ]
Decode: Four nulls                                                              [  OK  ]
Decode: Single space                                                            [  OK  ]
Decode: Four spaces = 'y'                                                       [  OK  ]
Decode: Null in middle                                                          [  OK  ]
Decode: Four null in middle                                                     [  OK  ]
Decode: String #1                                                               [  OK  ]
Decode: String #2                                                               [  OK  ]
Decode: String #3                                                               [  OK  ]
Decode: String #4                                                               [  OK  ]
Decode: Unprintable chars                                                       [  OK  ]

All 25 tests passed

scottchiefbaker commented 7 months ago

What a weird bug... I spent 45 minutes tracking it down but I think I got it with bcc9583. See if that works for you. I also added unit tests for this scenario so it won't catch us again in the future. If this isn't it then I will need a small file that recreates this problem I can test with. Smaller the better.

scottchiefbaker commented 7 months ago

On my Raspberry Pi 3 using a 1MB file as a test:

Encode time: 3.6 seconds
Decode time: 1.2 seconds

Even on a 32bit platform the algorithm takes the "fast" path 81.6% of the time. The "slow" path is only taken when the integer chunk is greater than 2^31.

romanrm commented 7 months ago

Now it works correctly for me on that 1M PNG file.

Takes 13 seconds to encode, 3 seconds to decode.

Thanks!

romanrm commented 7 months ago

I suggest to add a check if we are running on a 32-bit system, and then verify that the GMP PHP extension is installed (function_exists(gmp_init)) and throw an error outright about that if not. Because otherwise it might work without any error for simple cases and the user might deploy it thinking everything is fine, but will later fail in more complex ones.

scottchiefbaker commented 7 months ago

That's not a bad idea... How should it look? Just a simple print() statement? I landed something in 21c5098 as a test.

romanrm commented 7 months ago

I would put it not into the encode, but right into the PHP script itself, even outside class declaration. So any script which includes the PHP file, will run the check at include-time once, and fail to continue if it's not met. Also no need to check for "first" then.

Better print it to STDERR, and terminate execution. Using trigger_error("Message", E_USER_ERROR) can achieve both.

Also do not expect to be running in a web context, there's a ton of utility in PHP to do "shell scripts" with a better language, not necessarily web apps. So I suggest to omit any HTML formatting.

scottchiefbaker / php-base85

Does not work on 32-bit ARM (or maybe any 32-bit arch?) with raw binary data #2