Open theboredfish opened 4 years ago
V8JS_ZSTR uses v8::String::NewFromUtf8, so I think that is the problem. And explains why using mb_convert_encoding from ASCII to UTF-8 works. I think that we need to find a way to detect the value encoding and use NewFromOneByte instead. Maybe we can use mb_detect_encoding from PHP? But this approach will require mbstring module to be active.
Hmmm found something interesting... zend_string.h => ZSTR_IS_VALID_UTF8() I'll test this!
@chrisbckr This one is an old one but still a thorn in my side 😂 It's possibly a little down to my lack of knowledge around JS, too. Sounds like there's something in what you've found!! Very promising.
I'm just not sure whether it's my approach that's wrong, or whether there is something already suited to doing this.
Need more tests (and improvements), but... v8js_convert.cc -> zval_to_v8js
case IS_STRING:
value_str = Z_STR_P(value);
if (ZSTR_LEN(value_str) > std::numeric_limits<int>::max()) {
zend_throw_exception(php_ce_v8js_exception,
"String exceeds maximum string length", 0);
break;
}
zval fname, retval;
zval params[2];
ZVAL_STRING(&fname, "mb_check_encoding");
ZVAL_COPY_VALUE(¶ms[0], value);
ZVAL_STRING(¶ms[1], "UTF-8");
if ((SUCCESS == call_user_function(CG(function_table), NULL, &fname,
&retval, 2, params)) && (Z_TYPE(retval) == IS_TRUE)) {
jsValue = V8JS_ZSTR(value_str);
}
else {
jsValue = v8::String::NewFromOneByte(isolate, (unsigned char *)ZSTR_VAL(value_str), v8::NewStringType::kNormal, ZSTR_LEN(value_str)).ToLocalChecked();
}
break;
And at the end of Makefile.frag
# add mbstring extension, if available
ifneq (,$(realpath $(EXTENSION_DIR)/mbstring.so))
PHP_TEST_SHARED_EXTENSIONS+=-d extension=$(EXTENSION_DIR)/mbstring.so
endif
@stesie do you know any side effect of using NewFromOneByte instead of NewFromUtf8 in this case?
And maybe we can enable this checking with a flag or something like this.
And maybe we can enable this checking with a flag or something like this.
And maybe we can enable this checking with a flag or something like this.
It feels like this is probably the safest option if the PR is a go, but I’ll be quite happy to give this one a good test! (The OP is me from my other account)
i've been playing around with this test script, along with a PNG
<?php
class Foo extends \V8Js
{
private static $file = __DIR__ . '/test.png';
public function read(): string
{
return file_get_contents(self::$file);
}
public function write(string $content)
{
$oldContent = file_get_contents(self::$file);
echo ($oldContent === $content) ? "Files are the same!" : "Files are different :(";
echo "\n" . strlen($oldContent) . ' vs ' . strlen($content) . "\n"; // 6339 vs 9362 :(
echo mb_detect_encoding($content) . "\n"; // utf-8 :(
}
}
$v = new Foo();
$result = $v->executeString('
const content = PHP.read();
print(typeof content + "\n"); // outputs: string
print(content.length + "\n"); // same as what the actual filesize is :)
PHP.write(content);
');
Things work great on read() !!
Just struggling with the changes needed around v8js_to_zval
to be able to send the thing back. Encoding is detected as UTF-8, so I guess the same preservation treatment for these kind of things needs doing on the way BACK to PHP?
Good point. I will investigate this.
Test this @redbullmarky ! https://github.com/chrisbckr/v8js/tree/php8-binary_data
@chrisbckr ahh man, turns out what I was trying to do wasn’t far off ! Sure, will have a play in a little bit when I get a little time!
Do you know if it’s recursive? e.g. if I have a PHP array and perhaps one string is regular (like a filename) and the other is binary (like the content), does it iterate into the array calling these conversions?
Edit: Had a quick play, and it certainly does handle strings inside of arrays etc, too!
My crude test (I used force array flag to keep it as an array going both ways, else it's a V8Object):
<?php
class Foo extends \V8Js
{
private static $file = __DIR__ . '/test.png';
public function read(): string
{
echo "File: " . self::$file . "\n";
return file_get_contents(self::$file);
}
public function readArray(): array
{
return [
'file' => self::$file,
'content' => $this->read()
];
}
public function write(string $content)
{
$oldContent = file_get_contents(self::$file);
echo ($oldContent === $content) ? "Files are the same!" : "Files are different :(";
echo "\n" . strlen($oldContent) . ' vs ' . strlen($content) . "\n";
echo mb_detect_encoding($content) . "\n"; // utf-8 :(
}
public function writeArray($array)
{
$this->write($array['content']);
echo (strlen($array['file']) === strlen(self::$file) ? "YES!" : "NO") . "\n";
}
public function outLen(string $str)
{
echo "str is " . strlen($str) . "\n";
}
}
$v = new Foo();
$result = $v->executeString('
const content = PHP.read();
PHP.write(content);
const arr = PHP.readArray();
PHP.writeArray(arr);
PHP.outLen(arr.file);
PHP.outLen(arr.content);
', null, V8Js::FLAG_FORCE_ARRAY);
Just added a flag V8Js::FLAG_CHECK_BINARY_STRING to enable this feature(?) hehe
Awesome! One thing i picked up on testing (as mentioned above, in having to use the FORCE_ARRAY flag - and probably another topic/issue but more of a passing thought for now hehe) - is it possible that when an array (associative or otherwise) is passed FROM php to JS, it can also maintain its type when passed back? e.g. if i do this without force array flag:
class Foo extends V8Js
{
public function getArray()
{
return ['foo' => 'bar'];
}
public function setArray(array $array)
{
}
}
$v = new Foo();
$v->executeString('
const arr = PHP.getArray();
PHP.setArray(arr);
');
Then you get:
PHP Fatal error: Uncaught TypeError: Foo::setArray(): Argument #1 ($array) must be of type array, V8Object given.
I get why returning regular objects is useful, and it probably opens up a can of worms as to WHEN you might want a V8Object or an array (like, what if you constructed an "associative array" in JS, and passed it - how would you know whether to cast it to an array for PHP, or keep it as a V8Object? etc)
Otherwise - I love that this one could potentially solve a lot of headaches here! Awesome stuff :)
Hej, sorry for the kinda late reply, since you've already started coding. While I like the straight-forward solution and agree that it solves the raised issue, ... I still wonder if it's the right way to do it or the idiomatic solution. If I get the current implementation correctly it still wouldn't (easily) be possible to read an UTF-16 file (string data) and convert it to a JS string. Since the implementation falls back to single byte strings (and you can't make it invoke the fromUtf16 method).
To generalize further ...
In plain PHP you usually handle binary data in strings, since PHP doesn't care about the encoding at all. Which however has its limitations, since e.g. \strrev
or \strlen
just don't care about encoding and possibly lead to wrong/unexpected results. (and hence the mbstring extension that has function variants that handle encodings (most commonly utf-8 nowadays) well.
Contrary JavaScript land: strings "usually" are used for "real strings" only. Binary data is idiomatically handled via Buffer
(node) or ArrayBuffer
(ES/language itself).
So far php-v8js bridges PHP strings to JS strings just fine as long as you stick to the de facto "standard" of encoding strings with utf-8. And this has been fine for quite a while.
Hence from an idiomatic standpoint I think it'd be better that we'd allow to export a PHP string (i.e. zend_string) to V8 as a Uint8Array. We could even make it an external Uint8Array so data wouldn't have to be copied in memory from PHP to V8 (and likely vice versa) -- so as a benefit it'd also be suitable to pass around larger data sets efficiently.
And in contiunuation of the UTF-16 example above, if we'd provide the plain ArrayBuffer to JS land, a v8js user could simply do something like new TextDecoder('utf-16').decode(PHP.arrayBufferInstance)
to decode it to a JS string (as they see fit)
From PHP side this could look like this:
$v8 = new V8Js();
$binaryBlob = \file_get_contents('blob.bin');
$temp = new V8Uint8Array($binaryBlob); // this wraps the string
// (and ref-counts it, so it remains if user does "$binaryBlob = null")
// ... and it's also a marker for V8Js itself
$v8->arrayBufferInstance = $temp; // make ArrayBuffer available to JS land (`zval_to_v8js` detects it)
... that said, I've just tried passing back a Uint8Array from JS to PHP and it crashes (on my MacBook, haven't tried with Linux):
php > $v8 = new V8Js();
php > $a = $v8->executeString(' (new Uint8Array([65,66,67])); ');
Segmentation fault: 11
Aside from that, and contrary to the new V8Uint8Array
approach outlined above (which would involve providing more specialized V8Js/PHP classes) ... and for sake of consistency would involve converting passed-back JS Uint8Array
to the php-v8js specific variant ... it might be easier to just go with plain old \V8Object
that's around for a long time.
We could offer a V8Js::makeUint8Array(string $blob)
method, that may be used to wrap any string as a JS-side Uint8Array
and track it via \V8Object
on PHP side. So you could do something like this:
$v8 = new V8Js();
$binaryBlob = \file_get_contents('blob.bin');
$v8->arrayBufferInstance = $v8->makeUint8Array($binaryBlob);
$v8->executeString('var_dump(PHP.arrayBufferInstance instanceof Uint8Array)'); // true
Let me know what you think of this alternative approach. Looking forward to your replies.
... that said, I've just tried passing back a Uint8Array from JS to PHP and it crashes (on my MacBook, haven't tried with Linux):
I had the same when playing with ArrayObject, Uint8Array, etc on Linux. Passing them to PHP segfaults.
Anyways - if the coded approach isn't "JS-like" then any solutions would be welcome! I the thing I'd need to work out would be how to handle a PHP V8Js extension function potentially returning different things with different prototypes. e.g. if I had a function called readFile
on the V8Js instance, then:
In the solution @chrisbckr produced, it was possible to get a binary (or other) string, do whatever with it, and pass the binary strings back to PHP completely in-tact to a function typehinted as a string. I don't know what the difference would be from his NewFromOneByte
approach and a Uint8Array
would be. Perhaps they're just very different things and as such both could be supported, even if via a V8Js function rather than automatically? This is where my knowledge lacks a fair bit 😂
I think though, if it getting the length
(in bytes/chars) worked the same for both, and both were able to be passed to a V8Js typehinted function that accepted strings, e.g.
$v8js->writeFile = function (string $id, string $content) {
// write the file to disk
}
Then perhaps 99% of our issues are solved. The current workaround I have is to return a PHP object that has a toString() method, and implements most of the common prototype functions of JS String object via call() and __get() - so it somewhat behaves identically to a JS string as far as the JS user is concerned. Any operations performed on it (substr, length, startsWith, etc etc) are all done via these PHP methods and therefore give me the results I need as they're performed directly in PHP with no conversion. And any PHP functions i call from JS and pass this "string" to as a parameter will automatically cast it to a real string. It works well, but it's certainly not ideal and has its drawbacks...
@stesie I agree. My proposed solution keeps what PHP do with binary strings: it's not reliable and can (and probably will) give headaches some time in the future. Just need mb_check_encoding
return a false positive or false negative.
Indeed using Uint8Array will be the correct way to guarantee that the binary data will be 100% accurate all the time. And the control (and responsability) will be with the user, if something goes wrong, he/she knows where the problem is and how to fix. And having Uint8Array as a tools, this will be easy and reliable.
Personally I don't care to have a little more lines of code and assure that my data will not corrupt.
But I just didn't get how to use V8Object for this? Will create a property to set that it has a Uint8Array data? And maybe use this in the future for other cases? Or a new class V8Uint8Array that will extend V8Object? Because if we need to pass a Uint8Array to PHP we can use V8UintArray or something like V8Object::isUint8Array() (and maybe __tostring() to retrieve the binary on PHP side).
And in fact PHP has the flag IS_STR_VALID_UTF8
but... I didn't found any string that has this flag set hahaha
It's basically useless 😬
@redbullmarky If V8Js had a "makeUint8Array" you could create a wrapper (maybe with __get) to return Uint8Array for some functions like
class V8Env extends \V8Js {
public function getUint8Array($fn, ...$args) {
return $this->makeUint8Array($fn(...$args));
}
}
$v8 = new V8Env;
$v8->file_get_contents = fn(...$args) => $v8->getUint8Array('file_get_contents', ...$args);
$v8->executeString('var_dump(PHP.file_get_contents("binaryFile.bin"));');
or
class V8Env extends \V8Js {
public function makeUint8Array($bla) {
// just to simulate, since we really don't have this method yet
return $bla;
}
}
class Uint8Magic {
// here you can list what methods you will wrap to return Uint8Array
private $methodsWrapper = [
'file_get_contents'
];
private $v8;
public function __construct($v8) {
$this->v8 = $v8;
foreach ($this->methodsWrapper as $method) {
$v8->{$method} = $this->{$method};
}
}
public function __get($name) {
return fn(...$args) => $this->v8->makeUint8Array($name(...$args));
}
}
$v8 = new V8Env;
$uint = new Uint8Magic($v8);
$v8->executeString('var_dump(PHP.file_get_contents("binFile.bin"));');
Definitely a viable solution! Like I say, as long as there is a mechanism to read, manipulate and write binary contents across the PHP/JS boundary, all is good! Much of the inner workings can easily be abstracted away from the user and just..work!
But I just didn't get how to use V8Object for this?
I'm not sure if it's possible. It's just an idea I didn't want to rule out ...
Before thinking of how to pass Uint8Array (and others) from PHP to JS we (maybe?) should have a look at the JS -> PHP way first. I haven't yet found the time to investigate why it immediately crashes if you return a JS-side created Uint8Array
to PHP.
It feels a bit strange to me, I would have expected that php-v8js just wraps it as a \V8Object
and you can happily pass it around in PHP and return it back to JS. Even so the content may well be inaccessible.
So first thing I'd strive for is keeping it from crashing and passing an instance from JS to PHP and back (without PHP being able to access the content).
Then we could consider how PHP may get access to those TypedArray instances. Likely you can then already invoke TypedArray::at
from PHP and access single elements, ... but you cannot get the whole buffer as a string. And in PHP it'd be "natural" to use a (binary blob) string. So I'm not (yet) sure if it should just provide access to the "backing store" (read ArrayBuffer), so no matter if it's an Uint8Array or a Int16Array, ... just may just get the data as a "binary blob".
To provide such a mechanism we might have a new V8ArrayBuffer (extending V8Object) that has a getData
method. Maybe we'd want to provide __toString
, however I'm unsure regarding that since JS' TypedArray already implements toString
(however differently).
If the above works we can think of V8Js::makeUint8Array
or V8Js::makeArrayBuffer
or whatever ...
In terms of (pseudo) c++ code I was thinking of
zval *return_value
v8::Local<ArrayBuffer> buf = V8::ArrayBuffer::New(ctx->isolate, ZSTR_PTR(bla), ZSTR_LEN(bla));
v8js_v8object_create(return_value, jsValue, flags, isolate);
So if you
$v8 = new V8Js();
$blob = file_get_contents('blob.bin');
$buf = $v8->makeArrayBuffer( $blob );
... you'll have a V8Object
instance that has the newly created ArrayBuffer "in it".
And you can then export that to JS-side as usual (either by providing it to a function (as an argument) or exporting it on the PHP object itself, i.e. $v8->buf = $buf
So JS side can then create a TypedArray view on it
$v8->executeString('
const u8 = new Uint8Array( PHP.buf );
const result = u8.find/slice/replace( ... ) // do whatever
result; // pass back result, possible a UInt8Array (or an ArrayBuffer, or ...)
');
The more I think of it I feel like leaving all TypedArray instances opaque to PHP, i.e. pass them around with V8Object
but not providing extra features (beside being able to invoke stuff from TypedArray.prototype).
Instead handle the underlying ArrayBuffer
"specially" in that sense, that we provide means to convert zend_string to ArrayBuffer and back. This way we could read a file with PHP, offer the resulting zend_string as an ArrayBuffer instance to JS. JS code can decide if it's binary data (likely use a UInt8Array
as view) or whatever special representation that best fits with (UInt16Array
or maybe Int32Array
)
Hope that makes my thoughts a little clearer. Feel free to ask further questions if anything remains unclear. Or come up with different approaches, code, ... :-)
Upcoming week will be pretty busy. I won't be able to code anything before next weekend. But likely will find some time to possibly discuss this further
...I haven't yet found the time to investigate why it immediately crashes if you return a JS-side created
Uint8Array
to PHP.It feels a bit strange to me, I would have expected that php-v8js just wraps it as a
\V8Object
and you can happily pass it around in PHP and return it back to JS. Even so the content may well be inaccessible.So first thing I'd strive for is keeping it from crashing and passing an instance from JS to PHP and back (without PHP being able to access the content).
It seems this is where the segfault occurs, though at this point I don't have many other details:
https://github.com/phpv8/v8js/blob/php8/v8js_object_export.cc#L149
Z_ADDREF_P(&fci.params[i]);
Yes, just tested with the same code that @stesie tested.
$v8 = new V8Js();
echo "1" . PHP_EOL;
$a = $v8->executeString(' (new Uint8Array([65,66,67])); ');
var_dump($a);
In this case it uses it calls v8js_to_zval
in v8js_convert.cc
but it has basically the same code that you found.
Ant the problem is at this validation: (self->InternalFieldCount() == 2)
First run (yes, I just inserted a lot of php_printf hahahaha):
$ ./run-utf.sh uint8.php
1
IsObject!
InternalFieldCount = 2
InternalFieldCount setp 1
InternalFieldCount setp 2
InternalFieldCount setp 3
InternalFieldCount setp 4
Falha de segmentação
Then I changed the validation to == 3
and run again...
$ ./run-utf.sh uint8.php
1
IsObject!
InternalFieldCount = 2
Something else....
v8js_object_create....
success....
object(V8Object)#2 (3) {
["0"]=>
int(65)
["1"]=>
int(66)
["2"]=>
int(67)
}
old school debugging... remembering when I was 11 using MS-DOS 6.22 and playing with QuickBasic haha Simple but (kind of) effective.
What the heck!?
#ifndef V8_ARRAY_BUFFER_VIEW_INTERNAL_FIELD_COUNT
// The number of required internal fields can be defined by embedder.
#define V8_ARRAY_BUFFER_VIEW_INTERNAL_FIELD_COUNT 2
#endif
What the heck!?
#ifndef V8_ARRAY_BUFFER_VIEW_INTERNAL_FIELD_COUNT // The number of required internal fields can be defined by embedder. #define V8_ARRAY_BUFFER_VIEW_INTERNAL_FIELD_COUNT 2 #endif
https://www.mail-archive.com/v8-users@googlegroups.com/msg07066.html 😬
So for the moment it should suffice if we add && !jsValue->IsArrayBufferView() && !jsValue->IsArrayBuffer()
checks.
And then let's see how to the actual ArrayBuffer <-> PHP integration
Hey
Not sure whether this is a bug or a hole in knowledge...
I have a PHP function invoked from V8Js which returns raw binary data. The idea being that a user writing JS scripts in their sandbox can use this data with another V8Js-invoked PHP function to send it to an API.
The issue is the file appears to undergo a bit of conversion and i'm not sure of the best one-size-fits-all approach that'll work whether the file is binary or not. Ideally, no conversion at all would be best :)
We'll call the function called from JS
PHP.get_file()
for simplicity and assume that it does a basic file_get_contents() in PHP, returned directly.The only thing that worked in this case after randomly trying stuff was, instead of returning the value directly, doing a quick conversion first:
which seems a little odd, given that PHP tells me that the result of the mb_convert_encoding is 46013 but the variable in JS it's directly assigned to now reports it as 32404....the correct number. mb_detect_encoding at this stage gives me nothing, so i hard-coded "ASCII" for testing.
Until i pass the binary data back into another PHP function from my JS. Now it's back to 46013 and a quick mb_convert_encoding the other way (ASCII from UTF-8) gives me back my original 32404 file. mb_detect_encoding before that gives me UTF-8
Encodings give me headaches on good days, so apologies if that was a long-winded way to explain a simple thing. I'm hoping that there is a way this can be resolved WITHOUT the JS code writer having to change their stuff.
Is this a bug? Or something that's easily solvable? Hoping that there's: