nesk commented 6 years ago

The socket layer was written with little knowledge about socket management, it would highly benefit from a rewrite. First, those issues must be fixed:

[ ] Payloads sent through sockets are manually splitted in multiple chunks. This was done because the socket_read() method requires to specify the maximum number of bytes to read. For an unknown reason, I chose to implement a system based on chunks instead of calling socket_read() multiple times (until it returns an empty string).
[ ] The payload is encoded in base64 before being sent through the socket, and is decoded right after. This was a naive solution to handle non-ascii characters in the payloads and shouldn't be left as is because it can increase the size of the payload (depending on the data stored in it).
[ ] When multiple chunks are being received, a delay is applied before reading the next chunk. This was added because, on slow environments, the PHP process could try to read the socket stream before the Node process has time to send the next chunk.

Also, the following features could be added for extra security:

[ ] In PHP, we could use streams to store the data retrieved from the socket (to avoid exhausted memory in case of heavy payloads, like base64 images). However, it might not be useful since decoding the JSON payload doesn't handle streams, unless we use a package for this (and this might not be a good idea).
[ ] Check the integrity of the payload by using a cyclic redundancy check (use the hash functions in PHP in PHP, and the crc package for Node).

jonnywilliamson commented 6 years ago

Just some quick random thoughts at the moment as I haven't had time to delve into this deeply.

There are a lot of socket functions that all seem very similar on first glance but are subtlety different.

socket_read
socket_recv
socket_recvfrom
socket_recvmsg

Are some that spring to mind of the top of my head. Also I know that there is a lot of variables at work between detecting the end of the data sent/read by the socket and if the socket has actually closed (http://php.net/manual/en/function.socket-read.php#121426 and others). Does the clue/php-socket-raw library deal with this easily for you?

Reading through your thoughts on the data being sent to streams and being able to decode large amounts of it, I can see what you mean.

Am I right in saying thought, at the moment there's no protection in a large amount of data being sent from node to rialto, and then it trying to json_decode it....so this problem still exists right now too.

I was wondering if there was any sensible way in which you could pass in a small payload of text only, that could easily be json_decoded, but it served as only a manifest/index to describe further binary streams that might be received.

Let me explain more.

First data payload is just describing what is going to be received:

{
  "type": screenshot,
  "size": "12345678",
  "stream": {
    "id": "screenshot1",
  },
}

This is small and easily decoded/encoded with no worry of memory issues. Then the next data load is a binary dump that's actually put into a stream, hence can be as large as it wants to be. This wouldn't need to be "decoded" and hence bypass memory issues.

You would have to decide if it was acceptable to pass a raw stream to a user etc, but the underlying ability to deal with it would have been handled. To be honest I haven't fully thought this idea through but just throwing some ideas out there. Shoot down as necessary.

Finally, I did stumble on a few other libraries that might be of use:

https://github.com/pcrov/JsonReader (https://github.com/pcrov/JsonReader/wiki explaining between push and pull parsers)

Also Guzzle PSR-7 has a large section on streams which I have dabbled in before too. Maybe of use? https://github.com/guzzle/psr7#stream-implementation

nesk commented 6 years ago

I was actually thinking about dropping clue/php-socket-raw, I don't think it does that much for Rialto and it prevents from really understanding how to use sockets in PHP.

Your idea for the 2-steps payload (meta first, data next) is interesting. However, I don't think it could be used that way. Rialto is implementation-agnostic, this means the screenshot type makes sense only to the PuPHPeteer implementation, but not to Rialto.

Instead of thinking about screenshots, we could think about strings, since PuPHPeteer screenshots are base64 images and, therefore, strings. This means we would initiate a 2-steps payload for all strings. However, this 2-steps payload would be useful only for top-level strings, but we could perfectly have heavy strings in a JSON object, and then the 2-steps payload will not be used (because it would be difficult to identify).

Even if Rialto could recognize and use a screenshot type, the issue would be the same: the screenshot({encoding: 'base64'}) method should return a string, not a stream, the memory exhaustion would occur anyway by converting the stream to a string. I don't want to return a type different from the one specified in Puppeteer's documentation.

Actually, the memory exhaustion doesn't seem to be an issue for anyone today, and it is possible to bypass the issue by saving the screenshot to a file and read it later in PHP. I think we can focus on improving the socket layer before thinking about streams.

nesk commented 4 years ago

Wild idea: maybe sockets are just something really boring and hard to use and we can just replace them with an HTTP server for Node and an HTTP client for PHP. Communication will be way easier and reliable and I don't think we will lose performance.

yolo

cc @spekulatius

spekulatius commented 4 years ago

My experience with sockets much much smaller than with HTTP. So I guess I got a natural bias. It would open new doors for different forms of interfaces too.

I wonder if people are okay with an application spinning up a HTTP server within itself (even if restricted)? I could imagine some systems being locked down to open up ports too.

spekulatius commented 4 years ago

Note: A test has been commented out until this is resolved. See https://github.com/rialto-php/puphpeteer/pull/112

rialto-php / rialto

Improve the socket layer #12

yolo