nodejs / performance

Node.js team focusing on performance
MIT License
376 stars 7 forks source link

Buffer.byteLength #52

Open ronag opened 1 year ago

ronag commented 1 year ago

Does simdutf + V8 fast calls do anything for Buffer.byteLength which currently can be a bottleneck for e.g. http servers which are writing string to the response object.

ronag commented 1 year ago

This greatly affects performance of e.g. ServerResponse.write(string)

lemire commented 1 year ago

So we have that Buffer.byteLength(string, 'utf8')---which is what Buffer.byteLength(string) defaults to---seeks to return the number of bytes that the string would use, if it were serialized as UTF-8.

This ends up calling String::Utf8Length in v8 (api.cc), credit to @anonrig for the analysis:

int String::Utf8Length(Isolate* v8_isolate) const {
  i::Handle<i::String> str = Utils::OpenHandle(this);
  str = i::String::Flatten(reinterpret_cast<i::Isolate*>(v8_isolate), str);
  int length = str->length();
  if (length == 0) return 0;
  i::DisallowGarbageCollection no_gc;
  i::String::FlatContent flat = str->GetFlatContent(no_gc);
  DCHECK(flat.IsFlat());
  int utf8_length = 0;
  if (flat.IsOneByte()) {
    for (uint8_t c : flat.ToOneByteVector()) {
      utf8_length += c >> 7;
    }
    utf8_length += length;
  } else {
    int last_character = unibrow::Utf16::kNoPreviousCharacter;
    for (uint16_t c : flat.ToUC16Vector()) {
      utf8_length += unibrow::Utf8::Length(c, last_character);
      last_character = c;
    }
  }
  return utf8_length;
}

Let us focus on...

 int last_character = unibrow::Utf16::kNoPreviousCharacter;
 for (uint16_t c : flat.ToUC16Vector()) {
      utf8_length += unibrow::Utf8::Length(c, last_character);
      last_character = c;
 }

The ToUC16Vector() call returns a view of the whole of the data into a vector... and then it goes through it, character by character, with a tiny state machine, to compute the size... I think GetFlatContent does not do further copies or allocations, and is just a safety routine.

So you should be able to replace...

 int last_character = unibrow::Utf16::kNoPreviousCharacter;
 for (uint16_t c : flat.ToUC16Vector()) {
      utf8_length += unibrow::Utf8::Length(c, last_character);
      last_character = c;
 }

with the accelerated version...

auto v = flat.ToUC16Vector();
utf8_length = simdutf::utf8_length_from_utf16(v.start(), v.length())

In a future release of simdutf, we could just do...

int String::Utf8Length(Isolate* v8_isolate) const {
  i::Handle<i::String> str = Utils::OpenHandle(this);
  str = i::String::Flatten(reinterpret_cast<i::Isolate*>(v8_isolate), str);
  int length = str->length();
  if (length == 0) return 0;
  i::DisallowGarbageCollection no_gc;
  i::String::FlatContent flat = str->GetFlatContent(no_gc);
  DCHECK(flat.IsFlat());
  int utf8_length = 0;
  if (flat.IsOneByte()) {
    auto v = flat.ToOneByteVector();
    return  simdutf::utf8_length_from_latin(v.start(), v.length());
  }
  auto v = flat.ToUC16Vector();
  return  simdutf::utf8_length_from_utf16(v.start(), v.length());
}

It looks easy enough but this code I am modifying is in v8, not in Node.

If you don't want to modify v8, and I trust that you do not, then you need to intercept all of the str->Utf8Length(args.GetIsolate()) calls...

You have some fun code in Node, like this...

    const int32_t length = str->Utf8Length(args.GetIsolate()) + 1;
    char* buf = new char[length];
    str->WriteUtf8(args.GetIsolate(), buf, length);
    delete[] buf;

Of course, simdutf can do the WriteUtf8 faster too!!! Again, it is the same story, this calls into v8... and if you don't want to modify v8... well...

The question is therefore entirely a software engineering question. We know how to write the code, the simdutf library is already in node. So how does one do it?

TimothyGu commented 1 year ago

Has anyone explored integrating simdutf (or some equivalent library) into V8?

Also, you can effectively emulate Utf8Length through the V8 API by using some combination of v8::String::IsOneByte, v8::String::Write, and v8::String::WriteOneByte. However, Write/WriteOneByte does an extra memcpy whereas i::String::FlatContent is just a view, so I suppose the question is whether the benefit from simdutf will outweigh the memcpy.

ronag commented 1 year ago

I think doing anything in v8 will be difficult.

BridgeAR commented 1 year ago

We had a very productive collaboration with the @nodejs/v8 team before. It is of course a bit different to code there but I would favor to first try that.

As such, let's discuss the possibility of integrating simdutf as everyone would profit by that and it makes it easier for users such as Node.js.

ronag commented 1 year ago

@BridgeAR Sure! Who can champion that?

lemire commented 1 year ago

so I suppose the question is whether the benefit from simdutf will outweigh the memcpy.

A memcpy is cheap, but if you need to do extra memory allocations just to scan the data, it is a non-starter.

joyeecheung commented 1 year ago

I think the problem is that flattening the strings (notice the String::Flatten() call before DisallowGarbageCollection in String::Utf8Length()) can and is very likely to incur heap allocation on the V8 side - a very common case is Buffer.byteLength('hello' + 'world', 'utf8') and in this case V8 needs to flatten the result of 'hello' + 'world', and that results in allocation of a new string. If the string is not flattened, then the C++ side just cannot read it. To get around the problem we could try to copy all the code points to an array buffer in JS and pass that into the fast call but I am not sure if that would already kill any performance gain.

joyeecheung commented 1 year ago

Another way to solve it could be to add a new V8 API that returns whether the string is already flatten, and the fast API call gives up if it gets a false. Then in JS land we can do use some heuristics (like indexing into the strings) to flatten it before invoking the fast call. If that heuristics fails, we'd still go back to the slow branch.

lemire commented 1 year ago

If the string is not flattened

Right. So the function above, already in v8, Utf8Length, can cause a call to SlowGetFlatContent. But that can be remedied by writing the code differently in JavaScript land. If you can accelerate the flat case, then you can later explain to your users how to benefit. It will always be the case that you can write your JavaScript to get suboptimal performance... the engine can only do so much for you.

To get around the problem we could try to copy all the code points to an array

Isn't it what SlowGetFlatContent does already (conceptually)?

ronag commented 1 year ago

Maybe stupid question... but why flatten it at all? Why not just walk the string rope and get the size of each part?

joyeecheung commented 1 year ago

Sorry, I should've been clearer, https://github.com/nodejs/performance/issues/52#issuecomment-1426003906 was about using V8 fast API calls for the method, not about switching from Utf8Length to simdutf (and that's why heap allocation matter).

joyeecheung commented 1 year ago

Maybe stupid question... but why flatten it at all? Why not just walk the string rope and get the size of each part?

This is more of an V8 question, and my guess is when someone calls Utf8Length() it's very likely that they are going to actually read the content of that string later (which means V8 would need to flatten them to write into a contiguous block of memory anyway), so might as well just flatten it now. Also concatenated strings aren't the only case where the string needs flattening (for now there are also repeated strings and sliced strings that would strictly need some processing to turn into a flat string, but it's not certain that the list would not grow in the future) EDIT: not true, only cons strings strictly need flattening now, though that might still change in the future.

joyeecheung commented 1 year ago

And regarding replacing unibrow with simdutf, there is also a (pretty old) plan to switch unibrow to ICU altogether: https://bugs.chromium.org/p/v8/issues/detail?id=5500

For non-Chromium embedders like Node.js though the easier upstream request might be to just extend the String API a bit so that whatever used to calculate the size of the result or do the actual conversion is a parameter to a new API that we can customize. So for example, this is more likely to be accepted:

enum class ContentType {
  kLatin1,
  kUTF16LE,
};
using ByteLengthCalculator = int (*)(ContentType, const void* start, size_t size);

int String::Utf8Length(Isolate* v8_isolate, ByteLengthCalculator calculator) const {
  i::Handle<i::String> str = Utils::OpenHandle(this);
  str = i::String::Flatten(reinterpret_cast<i::Isolate*>(v8_isolate), str);
  int length = str->length();
  if (length == 0) return 0;
  i::DisallowGarbageCollection no_gc;
  i::String::FlatContent flat = str->GetFlatContent(no_gc);
  DCHECK(flat.IsFlat());
  int utf8_length = 0;
  if (flat.IsOneByte()) {
    auto one_byte_vector = flat.ToOneByteVector();
    return calculator(ContentType::kLatin1,
                      one_byte_vector.begin(), one_byte_vector.size());
  }
  auto two_byte_vector = flat.ToUC16Vector();
  return calculator(ContentType::kUTF16LE,
                    two_byte_vector.begin(), two_byte_vector.size());
}

String::WriteUtf8 can be done in a similar fasion. That avoids the problem of changing a dependency of V8, which can be more difficult (convincing V8 to add a new dependency that is only going to be used in a handful of embedder APIs is going to be a tough sell, while updating all the unibrow usage in V8 internals would also need more buy-in).

ronag commented 1 year ago
  if (flat.IsOneByte()) {
    auto one_byte_vector = flat.ToOneByteVector();
    return calculator(ContentType::kLatin1,
                      one_byte_vector.begin(), one_byte_vector.size());
  }

Is there any point with latin strings? Isn't it always a case of one_byte_vector.size()?

joyeecheung commented 1 year ago

The original implementation calculates the length of any character > 127 as 2, my guess is that's part of the eternal one byte string contract (which can choose to merge two characters into one uint8_t to save space). But we'd need a proper review to find out. I can submit a CL to test the water - this seems fairly straight-forward, so I guess an issue isn't strictly necessary, though it might help to get a better design if we have some kind of doc to explain what's needed? (e.g. if we want to also avoid flattening).

lemire commented 1 year ago

The original implementation calculates the length of any character > 127 as 2, my guess is that's part of the eternal one byte string contract (which can choose to merge two characters into one uint8_t to save space).

I think that one byte is always latin 1 (iso-8859-1). In that case, all character values greater or equal to 128 map to two bytes in UTF-8. That's true also of windows-1252, iso-8859-2 and iso-8859-9 (maybe there are others).

I think that all of the iso- encodings are extensions of ASCII, so <=128 is just ASCII (one byte in UTF-8).

There are several one-byte European formats like iso-8859-5, iso-8859-7, iso-8859-10, iso-8859-11 that translate into more than 2-byte UTF-8 characters some of the case. Obviously, non-European one-byte formats do that too.

lemire commented 1 year ago

For non-Chromium embedders like Node.js though the easier upstream request might be to just extend the String API a bit

Your proposal is more useful than just computing bytes... Once you have access to the underlying bytes, you can copy them, transcode them, filter them...

joyeecheung commented 1 year ago

Actually there is another way to implement a fast path for sequential one byte strings - just use the V8 fast API calls for these (which can only be hit if the string is already a one byte string that does not require flattening). My local benchmark shows a ~30% performance improvement for something like Buffer.byteLength('hello brendan!!!', 'utf8') (input is from the existing file under benchmark/ 😅 ), though I think this case might be a bit too edge-y to result in an impact for real-world usage..

mcollina commented 1 year ago

The fundamental issue is not in byteLength, it's in how V8 handle js strings in memory. Strings are represented as a tree (so concatenation is fast), but counting the length or linearizing it for further operations is costly.

Consider that we count the length of a string twice for HTTP responses (one for the headers, one for allocating the malloc before the actual libuv write).

If we didn't flatten the actual string, we would have to walk the tree 3 times.

Improving this was why we develop the flatstr module many years ago. The need for it was superseded once counting the byte length of a string started flattening it.

lemire commented 1 year ago

So no fast call possible when the string is non-ASCII?

lemire commented 1 year ago

once counting the byte length of a string started flattening it.

Right, so the flattening on access is deliberate.

joyeecheung commented 1 year ago

So no fast call possible when the string is non-ASCII?

It's the concatenation that's causing the issue, other strings that V8 currently have can be turned into a sequential one one way or another. A cons string needs to be flatten in some way before it can be exposed to the fast call, since it's unlikely that V8 would want to expose the internal structure to that.

joyeecheung commented 1 year ago

@lemire By the way does simdutf implements anything that does what the hypothetical utf8_length_from_latin does in https://github.com/nodejs/performance/issues/52#issuecomment-1424858958 (or a conversion method for similar purpose)? It might be nicer to use that in https://github.com/nodejs/node/pull/46616, although what it has now is also straight-forward enough for latin1 anyway..

lemire commented 1 year ago

@joyeecheung We will be adding support for latin 1, as it has been requested by various other users of the library: not only length computations but also transcoding. It is not present today. A SIMD-based utf8_length_from_latin could be coded (using the various ISAs we support) in about an hour (with most of the time spent on testing). Fast transcoding is less trivial. The AVX-512 version will be great fun.