ldubos / external-sorting

An external sorting algorithm implementation for NodeJS 💾
MIT License
15 stars 1 forks source link

Minor data corruption probably involving multibyte UTF8 boundaries #11

Closed wholebuzz closed 2 years ago

wholebuzz commented 2 years ago

Test case:

async function run() { const esortArgs = { input: fs.createReadStream('test.jsonl'), output: fs.createWriteStream('output.jsonl'), tempDir: __dirname, deserializer: JSON.parse, serializer: JSON.stringify, maxHeap: 100, } await esort.default(esortArgs).asc((x) => x.id) }

run()

- run and compare output to input

node test.js diff test.jsonl output.jsonl

< {"id":312,"date":"2021-10-15T12:00:25.000Z","guid":"https://www.thesixthaxis.com/?p=387637","link":"https://www. thesixthaxis.com/2021/10/15/what-we-played-520-scarlet-nexus-far-cry-6-and-metroid-dread/","feed":"http://www.thes ixthaxis.com/feed/","props":{"title":"What We Played #520 – Scarlet Nexus, Far Cry 6 and Metroid Dread","summary": "Is it spooky game time yet?"},"tags":{"topic":"ENTERTAINMENT","locale":"en-US","topics":["Community News","Featur ed Stories","Homepage Header","What We Played"],"category":"technology/gaming","classifiedCategory":"technology/ga ming"}}

{"id":312,"date":"2021-10-15T12:00:25.000Z","guid":"https://www.thesixthaxis.com/?p=387637","link":"https://www. thesixthaxis.com/2021/10/15/what-we-played-520-scarlet-nexus-far-cry-6-and-metroid-dread/","feed":"http://www.thes ixthaxis.com/feed/","props":{"title":"What We Played #520 ��� Scarlet Nexus, Far Cry 6 and Metroid Dread","summary ":"Is it spooky game time yet?"},"tags":{"topic":"ENTERTAINMENT","locale":"en-US","topics":["Community News","Feat ured Stories","Homepage Header","What We Played"],"category":"technology/gaming","classifiedCategory":"technology/ gaming"}}

Note the difference before Scarlet Nexus.

The issue doesn't seems to occur if only the row with id 312 is sorted, so it seems likely some Buffer accumulation issue? A possible work around may be to pipe the input to a line parser?

ldubos commented 2 years ago

Hi, I just fixed the issue the patch is available via an npm update, thanks for the report, I will close the ticket after your confirmation 🙂

wholebuzz commented 2 years ago

Awesome. Confirmed. Thanks for quick update!