nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
106.74k stars 29.12k forks source link

buffer.indexOf is incorrect in utf16le encoding for odd byteOffset #26448

Open hkjackey opened 5 years ago

hkjackey commented 5 years ago

Version: v11.10.1 Platform: Windows 10 (64-bits) Subsystem: buffer File Encoding: UTF8

Please consider the following 2 lines of code: let buf = Buffer.from('\u6881\u6882\u6881', 'utf16le'); console.log(buf.indexOf('\u6881', 1, 'utf16le'));

In this example, the expected correct output should be 4. However, the result is 0.

abhinavsagar commented 5 years ago

It would be much better if you showed exactly what the error log displays.

mkopa commented 5 years ago

@hkjackey ucs2 (utf16le) encoding is always two bytes. So byteOffset must by multiple of 2. Look at the code: https://github.com/nodejs/node/blob/master/src/node_buffer.cc#L945 you see offset has been divided by 2 and result multiplied by 2. If you put offset equal 1 you have 1 div 2 mul 2 == 0.

addaleax commented 5 years ago

Currently, indexOf with UTF16-LE is odd anyway in another respect as well, namely that we only search for the needle at even indices:

> Buffer.from('00aaaa', 'hex').indexOf('\uaaaa', 'utf16le')
-1
> Buffer.from('0000aaaa', 'hex').indexOf('\uaaaa', 'utf16le')
2

I’m not sure if this is “correct”, because some people might expect this?

/cc @nodejs/buffer

RReverser commented 5 years ago

I think it's correct as we wouldn't really want to break codepoint boundaries (although need to recheck if same holds for UTF-8 or any other encoding).

seishun commented 5 years ago

You mean code unit boundaries. This isn't relevant for UTF-8 because a UTF-8 code unit is a byte.

I suggest closing as expected behavior.

RReverser commented 5 years ago

You mean code unit boundaries. This isn't relevant for UTF-8 because a UTF-8 code unit is a byte.

No, I do mean a codepoint (1-4 bytes in case of UTF-8).

seishun commented 5 years ago

In UTF-16 a codepoint can be represented by one or two 16-bit code units. I don't think Buffer.indexOf has any checks for that.

RReverser commented 5 years ago

I don't think Buffer.indexOf has any checks for that.

Hmm that's another good question (although personally I'd prefer it to check if it doesn't).

hkjackey commented 5 years ago

@hkjackey ucs2 (utf16le) encoding is always two bytes. So byteOffset must by multiple of 2. Look at the code: https://github.com/nodejs/node/blob/master/src/node_buffer.cc#L945 you see offset has been divided by 2 and result multiplied by 2. If you put offset equal 1 you have 1 div 2 mul 2 == 0.

@mkopa Thanks for the reply.

I agree that utf16le encoding is always two bytes, but buffer is not. "buffer" can be intended to store a mix of many things. For example, the beginning 7 bytes of a buffer is used for some other things, then starting from the 7th byte is used to store utf16le string. In this case, an odd byteOffset is needed.

I regarded the current behavior as a bug because according to the NodeJS specification: https://nodejs.org/api/buffer.html#buffer_buf_indexof_value_byteoffset_encoding it does not mention that odd byteOffset is not allowed.

I would accept the current behavior not a bug if the NodeJS specification mentioned that.

hkjackey commented 5 years ago

Hi everybody! Below I want to persuade that returning -1 for odd byteOffset in utf16le can be really weird.

Please consider the following 5 lines of code: let buf = Buffer.alloc(7); let str = '\u6881\u6882\u6881'; console.log(buf.write(str, 1, 'utf16le')); //6 (successfully written) console.log(buf); //<Buffer 00 81 68 82 68 81 68> console.log(buf.indexOf(str, 1, 'utf16le')); //-1 (NOT found)

I write a string at odd byteOffset 1 but finding it using indexOf at the same position give me -1 which means "NOT found". If anyone still regard the above behavior valid, then I have nothing to say.

seishun commented 5 years ago

That might be a breaking change - if the UTF-16 data starts at an even offset, indexOf might find a code unit composed of an odd byte and an even byte that happen to match the input value. For instance, this test breaks.

However, the behavior in the issue description is clearly a bug - the offset should be rounded up to an even value, which is a simple fix. Thoughts?

addaleax commented 5 years ago

the offset should be rounded up to an even value, which is a simple fix. Thoughts?

I would disagree – I think buffer.indexOf(str, offset, enc) should work the same way that buffer.slice(offset).indexOf(str, enc) works (with an adjusted return value).

addaleax commented 5 years ago

@seishun Would that work, just using .slice() if the offset is odd and then adjusting the return value? That way you don’t run into that issue you described with an odd byte and an even byte that happen to match, right?

seishun commented 5 years ago

If the offset is odd, buffer.slice(offset).indexOf(str, enc) will only consider byte pairs that are odd and even in the original buffer. For example, Buffer.from('0000aaaa', 'hex').slice(1).indexOf('\uaaaa', 'utf16le') currently fails, but Buffer.from('0000aaaa', 'hex').indexOf('\uaaaa', 1, 'utf16le') succeeds.

addaleax commented 5 years ago

@seishun Yeah, I think that’s the behaviour that I would expect – by specifying an odd offset to begin with rather than an even one, I’m saying that I want to look at odd + even pairs rather than even + odd ones.

seishun commented 5 years ago

That seems unintuitive. Plus it wouldn't fix the original issue - indexOf would return -1 rather than the desired 4 or the current 0.

BridgeAR commented 5 years ago

@seishun it would AFAIK return 2, not -1? I agree with @addaleax and would also expect slice(1) to be used in this case.

hkjackey commented 5 years ago

Thank everybody for handling this issue!

seishun commented 5 years ago

@BridgeAR Have you tried it?

let buf = Buffer.from('\u6881\u6882\u6881', 'utf16le');
console.log(buf.indexOf('\u6881', 1, 'utf16le'));
console.log(buf.slice(1).indexOf('\u6881', 'utf16le'));
addaleax commented 5 years ago

@seishun Yeah, I think that’s (another) bug then. :/

seishun commented 5 years ago

@addaleax So what would be the expected behaviour?

addaleax commented 5 years ago

@seishun I would say in your example both methods should return -1. I think it shouldn’t matter whether a Buffer was created through .slice(), or allocated directly.

i.e., if buf is a Buffer, then buf.indexOf(...) and Buffer.from(buf).indexOf(...) should return the same results.

hkjackey commented 5 years ago

@BridgeAR Have you tried it?

let buf = Buffer.from('\u6881\u6882\u6881', 'utf16le');
console.log(buf.indexOf('\u6881', 1, 'utf16le'));
console.log(buf.slice(1).indexOf('\u6881', 'utf16le'));

@addaleax So what would be the expected behaviour?

I suggest returning 4 for the 1st case (without slice) and 3 for the 2nd case (with slice) after the fix.

seishun commented 5 years ago

So we have a conflict of opinions.

@addaleax (and @BridgeAR?) thinks indexOf with 'utf16le' should look only at odd-even byte pairs if the offset is odd, and only at even-odd pairs otherwise. @hkjackey thinks it should look at both odd-even and even-odd byte pairs regardless of the offset.

How should we proceed?

addaleax commented 5 years ago

I’m personally also okay with what @hkjackey’s suggestion – that’s most consistent with what .indexOf() does for other kinds of arguments.

seishun commented 5 years ago

That brings us back to my comment. If you think it's fine then I can just delete that test and my PR is practically ready.

hkjackey commented 5 years ago

@addaleax Thanks for your support! Also thank @seishun for fixing this bug!