tototoshi / scala-csv

CSV Reader/Writer for Scala
Other
698 stars 141 forks source link

Using multiple delimiters #321

Open Olgierd-Jankovski opened 3 months ago

Olgierd-Jankovski commented 3 months ago

Hi @tototoshi! It's really great you have created the parser and constantly adding updates on fixing/updating it's functionality!

As an employee of the company, we have run into certain issues, where multiple clients are sending csvfiles containing the specific data in predefined format, however, the only thing to take into account - are the delimiters, since we are able to assign only one delimiter (default one: is a ,).

Thus, we are unable to parse client's imported document, since each of them may contain ,or #or ;delimiter/separator characters (of course, it parses data wrongly, or even worse, results in crash, e.g.: while trying to parse the following line:

"hello world";123,321

I have seen an example of assigning(overriding default one) custom delimiter character:

implicit object MyFormat extends DefaultCSVFormat {
  override val delimiter = '#'
}

However, as I can understand, the parser does not support the functionality of supporting multiple delimiters, e.g.:

implicit object MyFormat extends DefaultCSVFormat {
  override val delimiterHashset = HashSet(";", ",", "#")
}

I would appreciate if we could discuss about possible solutions for solving that issue! Thank you for your time! Looking for your prompt reply!

Best Regards, Olgierd Jankovski

tototoshi commented 3 months ago

Hi @Olgierd-Jankovski

It seems that supporting multiple delimiters in the parser would be challenging. Delimiters are treated in a special way, and allowing for multiple ones would require significant changes to the parser’s implementation, likely affecting its performance and potentially impacting other functionalities like CSV writing.

I believe that a format with multiple delimiters might differ from the standard CSV format, which is why I tend to think that support for such a feature may not be necessary for a general-purpose CSV library. However, I recognize that this is just my perspective, and there may be more situations where such formats are commonly used.

If there are CSV libraries that support multiple delimiters, I would be very interested in learning more about them as a reference.

Olgierd-Jankovski commented 3 months ago

Thank you for your response!

True, supporting multiple delimiter parsing at a time could be challenging, and even worse, it may lead to the performance bottleneck. What came to my mind... I was thinking of the alternative way: of automatically detecting delimiter (assuming that he is unknown, but it is one of ", ; # |" symbols for sure), thus, after detecting the delimiter - the only thing left is to simply execute the current flow of parsing.

Of course, there are multiple problems that arise - how to satisfy the condition, that the delimiter is detected (e.g. sucessfully parsed a file that contains x amount of rows, each of them contains the same amount of elements)? Should we scan entire file or only a chunk of it (for delimiter detection)?

Indeed, this realization could prove challenging, but that feature, I believe, could make the parser to stand out the most. Moving into the examples, e.g.: where and how that feature persists, for now, I have met only few of them: https://github.com/nietras/Sep - written in C# and https://github.com/uniVocity/univocity-parsers - Java

However, even though they support delimiter detection functionality, it still unclear are the parse results valid, do they expect to satisfy a condition, e.g.: to parse a file, so the first row will contain a fixed amount of headers. I have not dived deeply into the implementation.

Thank you for your time! Best Regards,

Olgierd Jankovski