Use Nan::Utf8String over String::Utf8Value

springmeyer commented 7 years ago

A common way to get data from a v8::String to pass to C++ is:

std::string str(*v8::String::Utf8Value(info[0].As<String>());

While concise (by v8/c++ terms) this can be improved. Instead of v8::String::Utf8Value we can use Nan::Utf8String and we should validate the input before trying to create a std::string by:

ensuring the info[0] actually is a string
ensuring that if it is a string, then the length is > 0 or the string data is null (more on this below)

The Nan::Utf8String api both validates strings internally and is fast. It has an optimized conversion routine similar to node core. And it works across all recent node versions.

So we should convert all usage of String::Utf8Value to Nan::Utf8String. Like the String::Utf8Value interface, the Nan::Utf8String API exposes is the ability to check the length of the string before accessing in the value: this is a very good idea for safety and robustness. We don't want to risk passing a null string around, which might create unpredictable behavior or a crash like in https://github.com/mapbox/node-fontnik/issues/124.

To keep 100% test coverage, after adding lines that check the line, we will need to add a test that tries to pass an empty string, and will need to assert reasonable behavior in this case. So, let's move to code like this:

  Nan::Utf8String utf8_value(info[0]);
  int len = utf8_value.length();
  if (len <= 0) {
     return Nan::ThrowTypeError("arg must be a non-empty string");
  }
  /// work with string data here....
  // e.g. convert to a std::string
  std::string string_copy(*utf8_value, len);

An additional (albeit minor unless the string is massive) performance win of getting the len of the string is that we can pass it to the std::string constructor. Doing std::string string_copy(*utf8_value); would work, but providing the length allows the std::string constructor to avoid calculating the length internally and should be faster since it skips an operation. The std::string class has a bewildering number of different constructors. This one we are using that accepts a const char * as the first arg and an int as the second argument is the (4) from http://en.cppreference.com/w/cpp/string/basic_string/basic_string.

/cc @GretaCB

GretaCB commented 7 years ago

Awesome, thanks for this writeup @springmeyer .

Qvestion

Currently the skel does this...

std::string name = *v8::String::Utf8Value(info[0]->ToString());

Does the implementation in your comment above differ in any way re: allocation, since it seems as if Nan::Utf8String utf8_value(info[0]); is allocating on the stack? But I suppose this is trivial since it's using a pointer to pass into the C++ string instantiation?

~~Also curious where exactly allocation is happening. At std::string name = ?~~

My real question is...is Nan::Utf8String utf8_value(info[0]); allocating memory?

GretaCB commented 7 years ago

Also trying to figure out why ToString() is no longer necessary in your example. Because of Nan?

springmeyer commented 7 years ago

Currently the skel does this...

Sorry, made a copy/paste error. You are right - I've fixed the post above to note what we are currently doing and should avoid.

Also trying to figure out why ToString() is no longer necessary in your example. Because of Nan?

Really insightful! That is exactly the kind of question that will help you be cautious and write secure and robust code. In this case Nan has all the smarts needed internally to validate, in a fast way, that the string is valid, see https://github.com/nodejs/nan/blob/7291008d755f658343bcb61f456cb0a6744a6235/nan.h#L911. So if any of those fail then the length will be 0.

springmeyer commented 7 years ago

My real question is...is Nan::Utf8String utf8_value(info[0]); allocating memory?

GREAT question. You are thinking like a performance engineer!!!. So the answer is maybe. This is why NanUtf8String is so fast. Because it uses a kind of SSO (discussed at the bottom of https://github.com/mapbox/cpp/blob/master/glossary.md#instantiate). What that means is that it uses stack allocation for short strings and only falls back to dynamic allocation for strings longer than 1024 bytes. You can see this in action at https://github.com/nodejs/nan/blob/7291008d755f658343bcb61f456cb0a6744a6235/nan.h#L915-L918

GretaCB commented 7 years ago

Sorry, made a copy/paste error. You are right

Ah, wasnt saying you were wrong. It tied into my initial comment which was digging into the comparison between the old way and the Nan way (the strike-through above), but then I altered my qvestion.

The reason I asked about whether or not Nan::Utf8String utf8_value(info[0]); is allocating memory is because I was pondering if it was possible to bypass this allocation and just set std::string var to some pointer of info[0]...but I'm realizing now that's probably impossible since std::string can only take a value it understands, which is the v8/nan type. And we would want to check length anyway.

springmeyer commented 7 years ago

but then I altered my qvestion.

👍

The reason I asked about whether or not Nan::Utf8String utf8_value(info[0]); is allocating memory is because I was pondering if it was possible to bypass this allocation and just set std::string var to some pointer of info[0]...but I'm realizing now that's probably impossible since std::string can only take a value it understands, which is the v8/nan type. And we would want to check length anyway.

Right, std::string needs to own the data it contains. It cannot point to memory managed elsewhere. So I think the only way to create a std::string that bypass's allocation is to create a std::string by moving another std::string. That basically moves ownership of the memory from one variable to the next.

While we are thinking about bypassing allocation, another alternative would be to pass the pointer that references a V8 object directly into the threadpool (instead of converting to a std::string). This is the ideal. I've not tried this with Nan::Utf8String yet but it is commonly done with node::Buffer objects. Instead of converting the node::Buffer to a std::string or some other kind of C++ object to hold the bytes, we pass the buffer object directly to the threadpool and access its internal memory directly, without copies. This is done over at https://github.com/mapnik/node-mapnik/blob/9f0d98c685e87f7aef8df24fb1e3fc1112726f3b/src/mapnik_image.cpp#L3452-L3454. That 1) creates variables to point at the buffer data pointer (without copying anything but the pointer) and 2) marks the JS buffer object as persistent to keep it alive while working with its data in the threadpool.

GretaCB commented 7 years ago

Nan string PR merged at https://github.com/mapbox/node-cpp-skel/pull/55

@springmeyer Good to close this and carry future improvement convo per your last comment to separate ticket?

springmeyer commented 7 years ago

Good to close 👍 As far as last convo about moving to a node::Buffer over std::string, my feelins that would be a decent lift, so perhaps something to keep in mind, but not critical to ticket.

mapbox / node-cpp-skel

Use Nan::Utf8String over String::Utf8Value #54

Qvestion