mholt / PapaParse

Fast and powerful CSV (delimited text) parser that gracefully handles large files and malformed input
http://PapaParse.com
MIT License
12.55k stars 1.15k forks source link

Bugs with UTF-32 parsing #551

Open zioth opened 6 years ago

zioth commented 6 years ago

Someone gave me a UTF32 CSV exported from Excel. The export included lots of embedded nulls, which are normal in UTF32, but which PapaParse didn't strip out. It also had a URL with a leading newline. The value was quoted in the export. PapaParse didn't strip the quotes.

The same string (with the quote-newline-url-quote pattern) works fine in ASCII. It only fails in UTF32. This is true whether I set the "encoding" option or not.

zioth commented 6 years ago

Correction: The file was exported by MacOS Numbers, not Excel.

dboskovic commented 6 years ago

@zioth would you be able to share an example file here with a few problem lines and personal information removed?

shawn-eary commented 2 years ago

Does JavaScript support characters that are more than 2 bytes? https://stackoverflow.com/questions/2219526/how-many-bytes-in-a-javascript-string

In order for PapaParse to handle characters past U+FFFF, I think it would have to manually consider all four individual bytes of each UTF32 character.