hhvm / hsl-experimental

Experimental features for the Hack Standard Library
MIT License
23 stars 10 forks source link

[ Bug ] IO\BufferedReader::readUntilAsync() returns incorrect results when using a multibyte suffix under certain conditions #166

Closed lexidor closed 3 years ago

lexidor commented 3 years ago

Describe the bug IO\BufferedReader::readUntilAsync() scans for your suffix in chunks. If your suffix is larger than a byte and happens to align with the _IO\DEFAULT_READ_BUFFER_SIZE, the internal Str\contains() call returns false. This causes incorrect results.

Standalone code, or other way to reproduce the problem

This example uses a temp file instead of a MemoryHandle, because MemoryHandle uses Math\INT64_MAX as chunk size instead of the _IO\DEFAULT_READ_BUFFER_SIZE. A demonstration with a 8EiB string is unfeasible :smile:.

use namespace HH\Lib\{File, IO, Str};
<<__EntryPoint>>
async function main_with_hidden_autoloader(): Awaitable<void> {
  $buf_size = 8192;
  invariant(
    \HH\Lib\_Private\_IO\DEFAULT_READ_BUFFER_SIZE === $buf_size,
    'Default read buffer size has changed',
  );

  $minus3 = Str\repeat("\0", $buf_size - 3);
  using ($file = File\temporary_file()) {
    $handle = $file->getHandle();
    // When reading this string, the $chunk contains `ab<nul>...<nul>\r`
    // The second round around the while loop $chunk contains `\nsecond line`
    // The Str\contains() call does not find the `\r\n`, since it is split exactly between two chunks.
    // The third round around the loop reads an empty string into `buf`.
    // The ->readUntilAsync() call returns a Hack null value, even though the input contains `\r\n`.
    $contents = 'ab'.$minus3."\r\nsecond line";
    await $handle->writeAllAsync($contents);
    $handle->seek(0);
    $br = new IO\BufferedReader($file->getHandle());
    // NULL
    \var_dump(await $br->readUntilAsync("\r\n"));

    Str\replace(await $br->readAllAsync(), $minus3, '<nul>...<nul>')
      |> \addcslashes($$, "\r\n")
      // string(30) "ab<nul>...<nul>\r\nsecond line"
      |> \var_dump($$);
  }

  using ($file = File\temporary_file()) {
    $handle = $file->getHandle();
    // We read the leading <nul>s.
    // $this->buffer now contains `ab\r`
    // ->readUntilAsync() checks if `ab\r` contains `\r\n`, which it does not.
    // After that, we read `\nsecond line` into $chunk.
    // The Str\contains() call does not find the `\r\n`, since the first byte is in `$this->buffer` and not in `$chunk`.
    // The third round around the loop reads an empty string.
    // The ->readUntilAsync() call returns a Hack null value, even though the input contains `\r\n`.
    $contents = $minus3.'ab'."\r\nsecond line";
    await $handle->writeAllAsync($contents);
    $handle->seek(0);
    $br = new IO\BufferedReader($file->getHandle());
    await $br->readAsync($buf_size - 3);
    // NULL
    \var_dump(await $br->readUntilAsync("\r\n"));
    // string(17) "ab\r\nsecond line"
    await $br->readAllAsync() |> \addcslashes($$, "\r\n") |> \var_dump($$);
  }
}

Expected behavior

Calling ->readUntilAsync() should return "ab<nul>...<nul>" in the first using block and "ab" in the second.

Actual behavior

NULL
string(30) "ab<nul>...<nul>\r\nsecond line"
NULL
string(17) "ab\r\nsecond line"

Environment hsl-experimental version: v4.66.0 hhvm version: 4.85.0

Additional context

In order to solve this problem, we could accumulate into a buffer and use Str\search()'s third argument (offset) to prevent doing duplicate work.

This did get me thinking about multiple concurrent calls to the methods of IO\BufferedReader. They behave unpredictably since someone might start stealing your bytes while you are awaiting the filesystem. So you would get kbytes 0..8 and another caller would steal kbytes 9..16 from in between. You'd then append 17..24 to 0..8 and end up with missing bytes in your local buffer. How are we going to ensure that calling multiple async methods at once doesn't add more timing dependent races than were inherent in our $this->handle?

fredemmott commented 3 years ago

How are we going to ensure that calling multiple async methods at once doesn't add more timing dependent races than were inherent in our $this->handle?

I'm not convinced there's more but having multiple 'concurrent' callers interact with the same FD around other operations is unsafe, even if async IO ops aren't being used. If an FD needs to be shared between callers in the same process, some kind of application-level locking or queuing needs to be built. This needs to be application-level as there are often ordering constraints between multiple IO operations, e.g. when dealing with network protocols that do not support multiplexing, or pretty much involving seek

fredemmott commented 3 years ago

Failing unit test:

diff --git a/tests/io/BufferedReaderTest.php b/tests/io/BufferedReaderTest.php
index d08acc8..d1e0607 100644
--- a/tests/io/BufferedReaderTest.php
+++ b/tests/io/BufferedReaderTest.php
@@ -8,7 +8,8 @@
  *
  */

-use namespace HH\Lib\{IO, OS, Vec};
+use namespace HH\Lib\{IO, OS, Str, Vec};
+use namespace HH\Lib\_Private\_IO;

 use function Facebook\FBExpect\expect; // @oss-enable
 use type Facebook\HackTest\HackTest; // @oss-enable
@@ -107,6 +108,28 @@ final class BufferedReaderTest extends HackTest {
     expect(await $r->readUntilAsync("FOO"))->toEqual("cd");
   }

+  public async function testReadUntilBufferBoundary(): Awaitable<void> {
+    // Intent is to test the case when the separator starts in one chunk, and
+    // ends in another, i.e.:
+    // - Str\length($padding) < chunk size
+    // - Str\length($padding.$separator) > chunk size
+    $padding = Str\repeat('a', _IO\DEFAULT_READ_BUFFER_SIZE - 1);
+    $separator = 'bc';
+
+    list($r, $w) = IO\pipe();
+    concurrent {
+      await async {
+        await $w->writeAllAsync($padding.$separator.'junk');
+        $w->close();
+      };
+      await async {
+        $br = new IO\BufferedReader($r);
+        expect(await $br->readUntilAsync($separator))->toEqual($padding);
+        $r->close();
+      };
+    }
+  }
+
   public async function testReadLineVsReadUntil(): Awaitable<void> {
     $r = new IO\BufferedReader(new IO\MemoryHandle("ab\ncd"));
     expect(await $r->readLineAsync())->toEqual('ab');

If I replace the - 1 with -2, the test passes