mscdex / busboy

A streaming parser for HTML form data for node.js
MIT License
2.84k stars 213 forks source link

Problem on utf8 filename #340

Closed wesleimarinho closed 11 months ago

wesleimarinho commented 1 year ago

I have set defParamCharset to utf8, but I'm getting different results for some users.

When sending (for example, a file named Declaração.pdf):

ç - Hex: E7

Sometimes it receives the same filename, however sometimes (Hex 63 237) is received (they look the same, but are different characters.

The same thing happens with ã (Hex E3), which sometimes is received as (Hex 61 303).

Both users used the same browser (Chrome) and version (113), on same operating system (Windows 11 Portuguese), sending the same file.

The hex codes where obtained by using https://www.rapidtables.com/convert/number/ascii-to-hex.html.

Server uses Busboy 1.6.0 and NodeJs 16, on CentOS 7, system locale is en_US.UTF-8.

Anyone got any idea of what may be causing the issue?

mscdex commented 1 year ago

Have you captured a copy of the raw binary request data when it happens and compared the data to see if the difference in character bytes exists there (or is that what you've already done)? If so, there's nothing busboy can do about that as it can only work with what it's given.

wesleimarinho commented 1 year ago

@mscdex What is the recommended way to do that? Intercept the request in NodeJs or to get it on the sending tool?

mscdex commented 1 year ago

Either should work, but on the node side you could just do req.pipe(fs.createWriteStream('/tmp/foo')); instead of req.pipe(busboy); Then just use a hex editor (e.g. xxd on Linux) to look at the contents of the raw request and see what the bytes are for the filename in question.

wesleimarinho commented 11 months ago

@mscdex I finally found out what's happened. Filenames on Mac OS have this specificity https://stackoverflow.com/questions/6153345/different-utf8-encoding-in-filenames-os-x. Is there any way to configure busboy to always get filenames in NFC normalized form instead of fully decomposed form?

wesleimarinho commented 11 months ago

For the record, I've solved it on my end with:

const filenameBuffer = Buffer.from(file.name, 'utf-8');
const normalizedFilename = filenameBuffer.toString().normalize('NFC');
const encodedFilenameUTF8 = Buffer.from(normalizedFilename).toString('utf-8');

Therefore, this issue can be closed.