The CSVLoader is in many ways the bread and butter of table loading and needs to be really solid.
The current CSVLoader is based on a fork of papaparse which is very complex to maintain due to its dated and convoluted code style.
Options
Make a cleanup pass on papaparse code (typescript etc)
Fork another, cleaner CSV loader
Find another CSV loader that supports async iterator model and import it as a dependency rather than fork it, so that we benefit from open source fixes (see streaming issue below)
Write a custom CSV loader (ideally using state machine parser we use in JSONLoader etc - however state machine approach is likely complicated due to the "fluid" nature of CSV syntax.
Problem: Streaming
We do want support for streaming parsing.
When we initially surveyed the landscape in mid-2019, existing open source csv loaders that do support streaming usually did so from Node streams (push model).
However, the loaders.gl parseInBatches architecture is AsyncIterator based (pull model) - which is arguably more composable, modern and also aligns with Apache Arrow.
Converting node streams to AsyncIterator is fairly complex and typically involves forking and modifying the code, so we would still end up with a fork, unless we can upstream the AsyncIterator changes.
There is a branch that adds a generic stream to AsyncIterator adapter - this could be useful but ran into subtle issues and would likely require careful testing before landing.
Problem: Performance / Big Data
We do want a performant parser. Not clear which parser is fastest. Got some indications that papaparse is not competitive.
We also ran into issues with large datasets. Substring operation in Chrome retains the original string leading to excessive memory consumption.
The CSVLoader is in many ways the bread and butter of table loading and needs to be really solid.
The current CSVLoader is based on a fork of papaparse which is very complex to maintain due to its dated and convoluted code style.
Options
Problem: Streaming
parseInBatches
architecture is AsyncIterator based (pull model) - which is arguably more composable, modern and also aligns with Apache Arrow.Problem: Performance / Big Data