Open hkjackey opened 5 years ago
It would be much better if you showed exactly what the error log displays.
@hkjackey ucs2 (utf16le) encoding is always two bytes. So byteOffset must by multiple of 2. Look at the code: https://github.com/nodejs/node/blob/master/src/node_buffer.cc#L945 you see offset has been divided by 2 and result multiplied by 2. If you put offset equal 1 you have 1 div 2 mul 2 == 0.
Currently, indexOf with UTF16-LE is odd anyway in another respect as well, namely that we only search for the needle at even indices:
> Buffer.from('00aaaa', 'hex').indexOf('\uaaaa', 'utf16le')
-1
> Buffer.from('0000aaaa', 'hex').indexOf('\uaaaa', 'utf16le')
2
I’m not sure if this is “correct”, because some people might expect this?
/cc @nodejs/buffer
I think it's correct as we wouldn't really want to break codepoint boundaries (although need to recheck if same holds for UTF-8 or any other encoding).
You mean code unit boundaries. This isn't relevant for UTF-8 because a UTF-8 code unit is a byte.
I suggest closing as expected behavior.
You mean code unit boundaries. This isn't relevant for UTF-8 because a UTF-8 code unit is a byte.
No, I do mean a codepoint (1-4 bytes in case of UTF-8).
In UTF-16 a codepoint can be represented by one or two 16-bit code units. I don't think Buffer.indexOf has any checks for that.
I don't think Buffer.indexOf has any checks for that.
Hmm that's another good question (although personally I'd prefer it to check if it doesn't).
@hkjackey ucs2 (utf16le) encoding is always two bytes. So byteOffset must by multiple of 2. Look at the code: https://github.com/nodejs/node/blob/master/src/node_buffer.cc#L945 you see offset has been divided by 2 and result multiplied by 2. If you put offset equal 1 you have 1 div 2 mul 2 == 0.
@mkopa Thanks for the reply.
I agree that utf16le encoding is always two bytes, but buffer is not. "buffer" can be intended to store a mix of many things. For example, the beginning 7 bytes of a buffer is used for some other things, then starting from the 7th byte is used to store utf16le string. In this case, an odd byteOffset is needed.
I regarded the current behavior as a bug because according to the NodeJS specification: https://nodejs.org/api/buffer.html#buffer_buf_indexof_value_byteoffset_encoding it does not mention that odd byteOffset is not allowed.
I would accept the current behavior not a bug if the NodeJS specification mentioned that.
Hi everybody! Below I want to persuade that returning -1 for odd byteOffset in utf16le can be really weird.
Please consider the following 5 lines of code: let buf = Buffer.alloc(7); let str = '\u6881\u6882\u6881'; console.log(buf.write(str, 1, 'utf16le')); //6 (successfully written) console.log(buf); //<Buffer 00 81 68 82 68 81 68> console.log(buf.indexOf(str, 1, 'utf16le')); //-1 (NOT found)
I write a string at odd byteOffset 1 but finding it using indexOf at the same position give me -1 which means "NOT found". If anyone still regard the above behavior valid, then I have nothing to say.
That might be a breaking change - if the UTF-16 data starts at an even offset, indexOf
might find a code unit composed of an odd byte and an even byte that happen to match the input value. For instance, this test breaks.
However, the behavior in the issue description is clearly a bug - the offset should be rounded up to an even value, which is a simple fix. Thoughts?
the offset should be rounded up to an even value, which is a simple fix. Thoughts?
I would disagree – I think buffer.indexOf(str, offset, enc)
should work the same way that buffer.slice(offset).indexOf(str, enc)
works (with an adjusted return value).
@seishun Would that work, just using .slice()
if the offset is odd and then adjusting the return value? That way you don’t run into that issue you described with an odd byte and an even byte that happen to match, right?
If the offset is odd, buffer.slice(offset).indexOf(str, enc)
will only consider byte pairs that are odd and even in the original buffer. For example, Buffer.from('0000aaaa', 'hex').slice(1).indexOf('\uaaaa', 'utf16le')
currently fails, but Buffer.from('0000aaaa', 'hex').indexOf('\uaaaa', 1, 'utf16le')
succeeds.
@seishun Yeah, I think that’s the behaviour that I would expect – by specifying an odd offset to begin with rather than an even one, I’m saying that I want to look at odd + even pairs rather than even + odd ones.
That seems unintuitive. Plus it wouldn't fix the original issue - indexOf
would return -1
rather than the desired 4
or the current 0
.
@seishun it would AFAIK return 2
, not -1
? I agree with @addaleax and would also expect slice(1)
to be used in this case.
Thank everybody for handling this issue!
@BridgeAR Have you tried it?
let buf = Buffer.from('\u6881\u6882\u6881', 'utf16le');
console.log(buf.indexOf('\u6881', 1, 'utf16le'));
console.log(buf.slice(1).indexOf('\u6881', 'utf16le'));
@seishun Yeah, I think that’s (another) bug then. :/
@addaleax So what would be the expected behaviour?
@seishun I would say in your example both methods should return -1. I think it shouldn’t matter whether a Buffer
was created through .slice()
, or allocated directly.
i.e., if buf
is a Buffer, then buf.indexOf(...)
and Buffer.from(buf).indexOf(...)
should return the same results.
@BridgeAR Have you tried it?
let buf = Buffer.from('\u6881\u6882\u6881', 'utf16le'); console.log(buf.indexOf('\u6881', 1, 'utf16le')); console.log(buf.slice(1).indexOf('\u6881', 'utf16le'));
@addaleax So what would be the expected behaviour?
I suggest returning 4 for the 1st case (without slice) and 3 for the 2nd case (with slice) after the fix.
So we have a conflict of opinions.
@addaleax (and @BridgeAR?) thinks indexOf
with 'utf16le' should look only at odd-even byte pairs if the offset is odd, and only at even-odd pairs otherwise.
@hkjackey thinks it should look at both odd-even and even-odd byte pairs regardless of the offset.
How should we proceed?
I’m personally also okay with what @hkjackey’s suggestion – that’s most consistent with what .indexOf()
does for other kinds of arguments.
That brings us back to my comment. If you think it's fine then I can just delete that test and my PR is practically ready.
@addaleax Thanks for your support! Also thank @seishun for fixing this bug!
Version: v11.10.1 Platform: Windows 10 (64-bits) Subsystem: buffer File Encoding: UTF8
Please consider the following 2 lines of code: let buf = Buffer.from('\u6881\u6882\u6881', 'utf16le'); console.log(buf.indexOf('\u6881', 1, 'utf16le'));
In this example, the expected correct output should be 4. However, the result is 0.