phpv8 / v8js

V8 Javascript Engine for PHP — This PHP extension embeds the Google V8 Javascript Engine
http://pecl.php.net/package/v8js
MIT License
1.84k stars 200 forks source link

Binary data converted? #454

Open theboredfish opened 4 years ago

theboredfish commented 4 years ago

Hey

Not sure whether this is a bug or a hole in knowledge...

I have a PHP function invoked from V8Js which returns raw binary data. The idea being that a user writing JS scripts in their sandbox can use this data with another V8Js-invoked PHP function to send it to an API.

The issue is the file appears to undergo a bit of conversion and i'm not sure of the best one-size-fits-all approach that'll work whether the file is binary or not. Ideally, no conversion at all would be best :)

We'll call the function called from JS PHP.get_file() for simplicity and assume that it does a basic file_get_contents() in PHP, returned directly.

The only thing that worked in this case after randomly trying stuff was, instead of returning the value directly, doing a quick conversion first:

function get_file(string $filename): string
{
   .... stuff ....
   $content = file_get_contents($filename);
   return mb_convert_encoding($content, 'UTF-8', 'ASCII');
}

which seems a little odd, given that PHP tells me that the result of the mb_convert_encoding is 46013 but the variable in JS it's directly assigned to now reports it as 32404....the correct number. mb_detect_encoding at this stage gives me nothing, so i hard-coded "ASCII" for testing.

Until i pass the binary data back into another PHP function from my JS. Now it's back to 46013 and a quick mb_convert_encoding the other way (ASCII from UTF-8) gives me back my original 32404 file. mb_detect_encoding before that gives me UTF-8

Encodings give me headaches on good days, so apologies if that was a long-winded way to explain a simple thing. I'm hoping that there is a way this can be resolved WITHOUT the JS code writer having to change their stuff.

Is this a bug? Or something that's easily solvable? Hoping that there's:

chrisbckr commented 1 year ago

V8JS_ZSTR uses v8::String::NewFromUtf8, so I think that is the problem. And explains why using mb_convert_encoding from ASCII to UTF-8 works. I think that we need to find a way to detect the value encoding and use NewFromOneByte instead. Maybe we can use mb_detect_encoding from PHP? But this approach will require mbstring module to be active.

chrisbckr commented 1 year ago

Hmmm found something interesting... zend_string.h => ZSTR_IS_VALID_UTF8() I'll test this!

redbullmarky commented 1 year ago

@chrisbckr This one is an old one but still a thorn in my side 😂 It's possibly a little down to my lack of knowledge around JS, too. Sounds like there's something in what you've found!! Very promising.

I'm just not sure whether it's my approach that's wrong, or whether there is something already suited to doing this.

chrisbckr commented 1 year ago

Need more tests (and improvements), but... v8js_convert.cc -> zval_to_v8js

        case IS_STRING:
            value_str = Z_STR_P(value);
            if (ZSTR_LEN(value_str) > std::numeric_limits<int>::max()) {
                zend_throw_exception(php_ce_v8js_exception,
                    "String exceeds maximum string length", 0);
                break;
            }

            zval fname, retval;
            zval params[2];
            ZVAL_STRING(&fname, "mb_check_encoding");
            ZVAL_COPY_VALUE(&params[0], value);
            ZVAL_STRING(&params[1], "UTF-8");

            if ((SUCCESS == call_user_function(CG(function_table), NULL, &fname,
                                    &retval, 2, params)) && (Z_TYPE(retval) == IS_TRUE)) {
                jsValue = V8JS_ZSTR(value_str);
            }
            else {
                jsValue = v8::String::NewFromOneByte(isolate, (unsigned char *)ZSTR_VAL(value_str), v8::NewStringType::kNormal, ZSTR_LEN(value_str)).ToLocalChecked();
            }

            break;

And at the end of Makefile.frag

# add mbstring extension, if available
ifneq (,$(realpath $(EXTENSION_DIR)/mbstring.so))
PHP_TEST_SHARED_EXTENSIONS+=-d extension=$(EXTENSION_DIR)/mbstring.so
endif

@stesie do you know any side effect of using NewFromOneByte instead of NewFromUtf8 in this case?

chrisbckr commented 1 year ago

And maybe we can enable this checking with a flag or something like this.

chrisbckr commented 1 year ago

And maybe we can enable this checking with a flag or something like this.

redbullmarky commented 1 year ago

And maybe we can enable this checking with a flag or something like this.

It feels like this is probably the safest option if the PR is a go, but I’ll be quite happy to give this one a good test! (The OP is me from my other account)

redbullmarky commented 1 year ago

i've been playing around with this test script, along with a PNG

<?php

class Foo extends \V8Js
{
    private static $file = __DIR__ . '/test.png';

    public function read(): string
    {
        return file_get_contents(self::$file);
    }

    public function write(string $content)
    {
        $oldContent = file_get_contents(self::$file);
        echo ($oldContent === $content) ? "Files are the same!" : "Files are different :(";
        echo "\n" . strlen($oldContent) . ' vs ' . strlen($content) . "\n"; // 6339 vs 9362 :(
        echo mb_detect_encoding($content) . "\n"; // utf-8 :(
    }
}

$v = new Foo();
$result = $v->executeString('
    const content = PHP.read();
    print(typeof content + "\n"); // outputs: string
        print(content.length + "\n"); // same as what the actual filesize is :)
    PHP.write(content);
');

Things work great on read() !!

Just struggling with the changes needed around v8js_to_zval to be able to send the thing back. Encoding is detected as UTF-8, so I guess the same preservation treatment for these kind of things needs doing on the way BACK to PHP?

chrisbckr commented 1 year ago

Good point. I will investigate this.

chrisbckr commented 1 year ago

Test this @redbullmarky ! https://github.com/chrisbckr/v8js/tree/php8-binary_data

redbullmarky commented 1 year ago

@chrisbckr ahh man, turns out what I was trying to do wasn’t far off ! Sure, will have a play in a little bit when I get a little time!

redbullmarky commented 1 year ago

Do you know if it’s recursive? e.g. if I have a PHP array and perhaps one string is regular (like a filename) and the other is binary (like the content), does it iterate into the array calling these conversions?

Edit: Had a quick play, and it certainly does handle strings inside of arrays etc, too!

My crude test (I used force array flag to keep it as an array going both ways, else it's a V8Object):

<?php

class Foo extends \V8Js
{
    private static $file = __DIR__ . '/test.png';

    public function read(): string
    {
        echo "File: " . self::$file . "\n";
        return file_get_contents(self::$file);
    }

    public function readArray(): array
    {
        return [
            'file' => self::$file,
            'content' => $this->read()
        ];
    }

    public function write(string $content)
    {
        $oldContent = file_get_contents(self::$file);
        echo ($oldContent === $content) ? "Files are the same!" : "Files are different :(";
        echo "\n" . strlen($oldContent) . ' vs ' . strlen($content) . "\n";
        echo mb_detect_encoding($content) . "\n"; // utf-8 :(
    }

    public function writeArray($array)
    {
        $this->write($array['content']);
        echo (strlen($array['file']) === strlen(self::$file) ? "YES!" : "NO") . "\n";
    }

    public function outLen(string $str)
    {
        echo "str is " . strlen($str) . "\n";
    }
}

$v = new Foo();
$result = $v->executeString('
    const content = PHP.read();
    PHP.write(content);

    const arr = PHP.readArray();
    PHP.writeArray(arr);

    PHP.outLen(arr.file);
    PHP.outLen(arr.content);
', null, V8Js::FLAG_FORCE_ARRAY);
chrisbckr commented 1 year ago

Just added a flag V8Js::FLAG_CHECK_BINARY_STRING to enable this feature(?) hehe

redbullmarky commented 1 year ago

Awesome! One thing i picked up on testing (as mentioned above, in having to use the FORCE_ARRAY flag - and probably another topic/issue but more of a passing thought for now hehe) - is it possible that when an array (associative or otherwise) is passed FROM php to JS, it can also maintain its type when passed back? e.g. if i do this without force array flag:

class Foo extends V8Js
{
   public function getArray()
   {
      return ['foo' => 'bar'];
   }

  public function setArray(array $array)
  {
  }
}
$v = new Foo();
$v->executeString('
   const arr = PHP.getArray();
   PHP.setArray(arr); 
');

Then you get:

PHP Fatal error:  Uncaught TypeError: Foo::setArray(): Argument #1 ($array) must be of type array, V8Object given. 

I get why returning regular objects is useful, and it probably opens up a can of worms as to WHEN you might want a V8Object or an array (like, what if you constructed an "associative array" in JS, and passed it - how would you know whether to cast it to an array for PHP, or keep it as a V8Object? etc)

redbullmarky commented 1 year ago

Otherwise - I love that this one could potentially solve a lot of headaches here! Awesome stuff :)

stesie commented 1 year ago

Hej, sorry for the kinda late reply, since you've already started coding. While I like the straight-forward solution and agree that it solves the raised issue, ... I still wonder if it's the right way to do it or the idiomatic solution. If I get the current implementation correctly it still wouldn't (easily) be possible to read an UTF-16 file (string data) and convert it to a JS string. Since the implementation falls back to single byte strings (and you can't make it invoke the fromUtf16 method).

To generalize further ...

In plain PHP you usually handle binary data in strings, since PHP doesn't care about the encoding at all. Which however has its limitations, since e.g. \strrev or \strlen just don't care about encoding and possibly lead to wrong/unexpected results. (and hence the mbstring extension that has function variants that handle encodings (most commonly utf-8 nowadays) well.

Contrary JavaScript land: strings "usually" are used for "real strings" only. Binary data is idiomatically handled via Buffer (node) or ArrayBuffer (ES/language itself).

So far php-v8js bridges PHP strings to JS strings just fine as long as you stick to the de facto "standard" of encoding strings with utf-8. And this has been fine for quite a while.

Hence from an idiomatic standpoint I think it'd be better that we'd allow to export a PHP string (i.e. zend_string) to V8 as a Uint8Array. We could even make it an external Uint8Array so data wouldn't have to be copied in memory from PHP to V8 (and likely vice versa) -- so as a benefit it'd also be suitable to pass around larger data sets efficiently.

And in contiunuation of the UTF-16 example above, if we'd provide the plain ArrayBuffer to JS land, a v8js user could simply do something like new TextDecoder('utf-16').decode(PHP.arrayBufferInstance) to decode it to a JS string (as they see fit)

From PHP side this could look like this:

$v8 = new V8Js();

$binaryBlob = \file_get_contents('blob.bin');
$temp = new V8Uint8Array($binaryBlob);  // this wraps the string
// (and ref-counts it, so it remains if user does "$binaryBlob = null")
// ... and it's also a marker for V8Js itself

$v8->arrayBufferInstance  = $temp; // make ArrayBuffer available to JS land (`zval_to_v8js` detects it)

... that said, I've just tried passing back a Uint8Array from JS to PHP and it crashes (on my MacBook, haven't tried with Linux):

php > $v8 = new V8Js();
php > $a = $v8->executeString(' (new Uint8Array([65,66,67])); ');
Segmentation fault: 11

Aside from that, and contrary to the new V8Uint8Array approach outlined above (which would involve providing more specialized V8Js/PHP classes) ... and for sake of consistency would involve converting passed-back JS Uint8Array to the php-v8js specific variant ... it might be easier to just go with plain old \V8Object that's around for a long time.

We could offer a V8Js::makeUint8Array(string $blob) method, that may be used to wrap any string as a JS-side Uint8Array and track it via \V8Object on PHP side. So you could do something like this:

$v8 = new V8Js();

$binaryBlob = \file_get_contents('blob.bin');
$v8->arrayBufferInstance = $v8->makeUint8Array($binaryBlob);

$v8->executeString('var_dump(PHP.arrayBufferInstance instanceof Uint8Array)');  // true

Let me know what you think of this alternative approach. Looking forward to your replies.

redbullmarky commented 1 year ago

... that said, I've just tried passing back a Uint8Array from JS to PHP and it crashes (on my MacBook, haven't tried with Linux):

I had the same when playing with ArrayObject, Uint8Array, etc on Linux. Passing them to PHP segfaults.

Anyways - if the coded approach isn't "JS-like" then any solutions would be welcome! I the thing I'd need to work out would be how to handle a PHP V8Js extension function potentially returning different things with different prototypes. e.g. if I had a function called readFile on the V8Js instance, then:

  1. When it reads a plain text file, it comes back as a string - so all of the JS string prototype functions are available.
  2. When it reads binary data, it would come back as something else. So you couldn't handle it the same way if you're looping a list of files, for example.

In the solution @chrisbckr produced, it was possible to get a binary (or other) string, do whatever with it, and pass the binary strings back to PHP completely in-tact to a function typehinted as a string. I don't know what the difference would be from his NewFromOneByte approach and a Uint8Array would be. Perhaps they're just very different things and as such both could be supported, even if via a V8Js function rather than automatically? This is where my knowledge lacks a fair bit 😂

I think though, if it getting the length (in bytes/chars) worked the same for both, and both were able to be passed to a V8Js typehinted function that accepted strings, e.g.

$v8js->writeFile = function (string $id, string $content) {
   // write the file to disk
}

Then perhaps 99% of our issues are solved. The current workaround I have is to return a PHP object that has a toString() method, and implements most of the common prototype functions of JS String object via call() and __get() - so it somewhat behaves identically to a JS string as far as the JS user is concerned. Any operations performed on it (substr, length, startsWith, etc etc) are all done via these PHP methods and therefore give me the results I need as they're performed directly in PHP with no conversion. And any PHP functions i call from JS and pass this "string" to as a parameter will automatically cast it to a real string. It works well, but it's certainly not ideal and has its drawbacks...

chrisbckr commented 1 year ago

@stesie I agree. My proposed solution keeps what PHP do with binary strings: it's not reliable and can (and probably will) give headaches some time in the future. Just need mb_check_encoding return a false positive or false negative.

Indeed using Uint8Array will be the correct way to guarantee that the binary data will be 100% accurate all the time. And the control (and responsability) will be with the user, if something goes wrong, he/she knows where the problem is and how to fix. And having Uint8Array as a tools, this will be easy and reliable.

Personally I don't care to have a little more lines of code and assure that my data will not corrupt.

But I just didn't get how to use V8Object for this? Will create a property to set that it has a Uint8Array data? And maybe use this in the future for other cases? Or a new class V8Uint8Array that will extend V8Object? Because if we need to pass a Uint8Array to PHP we can use V8UintArray or something like V8Object::isUint8Array() (and maybe __tostring() to retrieve the binary on PHP side).

chrisbckr commented 1 year ago

And in fact PHP has the flag IS_STR_VALID_UTF8 but... I didn't found any string that has this flag set hahaha It's basically useless 😬

chrisbckr commented 1 year ago

@redbullmarky If V8Js had a "makeUint8Array" you could create a wrapper (maybe with __get) to return Uint8Array for some functions like

class V8Env extends \V8Js {
    public function getUint8Array($fn, ...$args) {
    return $this->makeUint8Array($fn(...$args));
    }
}

$v8 = new V8Env;
$v8->file_get_contents = fn(...$args) => $v8->getUint8Array('file_get_contents', ...$args);
$v8->executeString('var_dump(PHP.file_get_contents("binaryFile.bin"));');

or

class V8Env extends \V8Js {
    public function makeUint8Array($bla) {
    // just to simulate, since we really don't have this method yet
    return $bla;
    }
}

class Uint8Magic {

    // here you can list what methods you will wrap to return Uint8Array
    private $methodsWrapper = [
    'file_get_contents'
    ];
    private $v8;

    public function __construct($v8) {
        $this->v8 = $v8;
    foreach ($this->methodsWrapper as $method) {
        $v8->{$method} = $this->{$method};
    }
    }

    public function __get($name) {
    return fn(...$args) => $this->v8->makeUint8Array($name(...$args));
    }
}

$v8 = new V8Env;
$uint = new Uint8Magic($v8);
$v8->executeString('var_dump(PHP.file_get_contents("binFile.bin"));');
redbullmarky commented 1 year ago

Definitely a viable solution! Like I say, as long as there is a mechanism to read, manipulate and write binary contents across the PHP/JS boundary, all is good! Much of the inner workings can easily be abstracted away from the user and just..work!

stesie commented 1 year ago

But I just didn't get how to use V8Object for this?

I'm not sure if it's possible. It's just an idea I didn't want to rule out ...

Before thinking of how to pass Uint8Array (and others) from PHP to JS we (maybe?) should have a look at the JS -> PHP way first. I haven't yet found the time to investigate why it immediately crashes if you return a JS-side created Uint8Array to PHP.

It feels a bit strange to me, I would have expected that php-v8js just wraps it as a \V8Object and you can happily pass it around in PHP and return it back to JS. Even so the content may well be inaccessible.

So first thing I'd strive for is keeping it from crashing and passing an instance from JS to PHP and back (without PHP being able to access the content).

Then we could consider how PHP may get access to those TypedArray instances. Likely you can then already invoke TypedArray::at from PHP and access single elements, ... but you cannot get the whole buffer as a string. And in PHP it'd be "natural" to use a (binary blob) string. So I'm not (yet) sure if it should just provide access to the "backing store" (read ArrayBuffer), so no matter if it's an Uint8Array or a Int16Array, ... just may just get the data as a "binary blob".

To provide such a mechanism we might have a new V8ArrayBuffer (extending V8Object) that has a getData method. Maybe we'd want to provide __toString, however I'm unsure regarding that since JS' TypedArray already implements toString (however differently).

If the above works we can think of V8Js::makeUint8Array or V8Js::makeArrayBuffer or whatever ... In terms of (pseudo) c++ code I was thinking of

zval *return_value

v8::Local<ArrayBuffer> buf = V8::ArrayBuffer::New(ctx->isolate, ZSTR_PTR(bla), ZSTR_LEN(bla));
v8js_v8object_create(return_value, jsValue, flags, isolate);

So if you

$v8 = new V8Js();

$blob = file_get_contents('blob.bin');
$buf = $v8->makeArrayBuffer( $blob );

... you'll have a V8Object instance that has the newly created ArrayBuffer "in it". And you can then export that to JS-side as usual (either by providing it to a function (as an argument) or exporting it on the PHP object itself, i.e. $v8->buf = $buf

So JS side can then create a TypedArray view on it

$v8->executeString('
  const u8 = new Uint8Array( PHP.buf );
  const result = u8.find/slice/replace( ... ) // do whatever
  result; // pass back result, possible a UInt8Array (or an ArrayBuffer, or ...)
');

The more I think of it I feel like leaving all TypedArray instances opaque to PHP, i.e. pass them around with V8Object but not providing extra features (beside being able to invoke stuff from TypedArray.prototype). Instead handle the underlying ArrayBuffer "specially" in that sense, that we provide means to convert zend_string to ArrayBuffer and back. This way we could read a file with PHP, offer the resulting zend_string as an ArrayBuffer instance to JS. JS code can decide if it's binary data (likely use a UInt8Array as view) or whatever special representation that best fits with (UInt16Array or maybe Int32Array)

Hope that makes my thoughts a little clearer. Feel free to ask further questions if anything remains unclear. Or come up with different approaches, code, ... :-)

Upcoming week will be pretty busy. I won't be able to code anything before next weekend. But likely will find some time to possibly discuss this further

redbullmarky commented 1 year ago

...I haven't yet found the time to investigate why it immediately crashes if you return a JS-side created Uint8Array to PHP.

It feels a bit strange to me, I would have expected that php-v8js just wraps it as a \V8Object and you can happily pass it around in PHP and return it back to JS. Even so the content may well be inaccessible.

So first thing I'd strive for is keeping it from crashing and passing an instance from JS to PHP and back (without PHP being able to access the content).

It seems this is where the segfault occurs, though at this point I don't have many other details:

https://github.com/phpv8/v8js/blob/php8/v8js_object_export.cc#L149

Z_ADDREF_P(&fci.params[i]);
chrisbckr commented 1 year ago

Yes, just tested with the same code that @stesie tested.

$v8 = new V8Js();
echo "1" . PHP_EOL;
$a = $v8->executeString(' (new Uint8Array([65,66,67])); ');
var_dump($a);

In this case it uses it calls v8js_to_zval in v8js_convert.cc but it has basically the same code that you found. Ant the problem is at this validation: (self->InternalFieldCount() == 2) First run (yes, I just inserted a lot of php_printf hahahaha):

$ ./run-utf.sh uint8.php
1
IsObject!
InternalFieldCount = 2
InternalFieldCount setp 1
InternalFieldCount setp 2
InternalFieldCount setp 3
InternalFieldCount setp 4
Falha de segmentação

Then I changed the validation to == 3 and run again...

$ ./run-utf.sh uint8.php
1
IsObject!
InternalFieldCount = 2
Something else....
v8js_object_create....
success....
object(V8Object)#2 (3) {
  ["0"]=>
  int(65)
  ["1"]=>
  int(66)
  ["2"]=>
  int(67)
}

old school debugging... remembering when I was 11 using MS-DOS 6.22 and playing with QuickBasic haha Simple but (kind of) effective.

stesie commented 1 year ago

What the heck!?

#ifndef V8_ARRAY_BUFFER_VIEW_INTERNAL_FIELD_COUNT
// The number of required internal fields can be defined by embedder.
#define V8_ARRAY_BUFFER_VIEW_INTERNAL_FIELD_COUNT 2
#endif
chrisbckr commented 1 year ago

What the heck!?

#ifndef V8_ARRAY_BUFFER_VIEW_INTERNAL_FIELD_COUNT
// The number of required internal fields can be defined by embedder.
#define V8_ARRAY_BUFFER_VIEW_INTERNAL_FIELD_COUNT 2
#endif

https://www.mail-archive.com/v8-users@googlegroups.com/msg07066.html 😬

stesie commented 1 year ago

So for the moment it should suffice if we add && !jsValue->IsArrayBufferView() && !jsValue->IsArrayBuffer() checks.

And then let's see how to the actual ArrayBuffer <-> PHP integration