Add support for CSV encoding for BATCH messaging

aaronsteers commented 1 year ago

As noted elsewhere, this is relatively trivial to implement engineering-wise but non-trivial config-wise/spec-wise, and specifically because there isn't and never has been an exact "CSV spec".

"CSV" is really not a single spec

Rather, CSV should be consider to be a family of related formats, which may or may not be comma delimited. Wikipedia article (unintentionally) does a pretty good job of highlighting the challenges here:

From https://en.wikipedia.org/wiki/Comma-separated_values (emphasis mine):

The CSV file format is not fully standardized. Separating fields with commas is the foundation, but commas in the data or embedded line breaks have to be handled specially. Some implementations disallow such content while others surround the field with quotation marks, which yet again creates the need for escaping if quotation marks are present in the data.

_The term "CSV" also denotes several closely-related delimiter-separated formats that use other field delimiters such as semicolons.[2]_ These include tab-separated values and space-separated values. A delimiter guaranteed not to be part of the data greatly simplifies parsing.

_Alternative delimiter-separated files are often given a ".csv" extension despite the use of a non-comma field separator. This loose terminology can cause problems in data exchange._ Many applications that accept CSV files have options to select the delimiter character and the quotation character. Semicolons are often used instead of commas in many European locales in order to use the comma as the decimal separator and, possibly, the period as a decimal grouping character.

Commonly unsupported options

Examples of classic challenges when using CSV files as an interchange format:

Ambivalence towards column headers.
- Notoriously, Snowflake doesn't read file headers - it only specifies a number of header rows to skip, and it expects all fields are provided in ordinal number order matching the filespec.
Newlines triggering "next record" regardless of escaping/quoting.
- Many CSV parsers do not correctly parse newlines when provided as content within provided data. A config-level option for taps may be required to replace newline characters with another string, such as ';' or '' (empty string). Sometimes it is preferable to replace the newline character with '\n', and then the dbt layer or another post-processing step may replace the string '\n' with an actual newline character.
Unsupported character encodings.
- This occurs most often when dealing with cross-OS or cross-region interchange. To my knowledge, this is best mitigated by explicitly requiring encoding to be utf-8.

Not re-inventing the wheel

As has been helpful in other contexts, I'll propose we start with an already existing spec for config.

Details

According to ChatGPT these are the most popular ways to process CSV files with Python: > 1. csv.reader(): This is a built-in Python library that provides a fast and efficient way to read CSV files. It returns an iterator that can be used to iterate over the rows in the CSV file. This is the most popular method of reading CSV files in Python. > 1. pandas.read_csv(): This is a popular method of reading CSV files in Python. It uses the pandas library, which provides high-performance, easy-to-use data structures and data analysis tools. It returns a DataFrame object that can be used to manipulate the data. > 1. numpy.loadtxt(): This is another popular method of reading CSV files in Python. It uses the numpy library, which provides a powerful array computing and linear algebra toolset. It returns an array object that can be used to manipulate the data. > 1. csv.DictReader(): This is similar to csv.reader(), but it returns a dictionary object for each row, with the keys being the column headers and the values being the row data. This can be useful if you need to access specific columns of the data. Essentially, this would point us towards one of three libraries `csv`, `pandas`, or `numpy`. The `csv` library is the only one shipped with standard libraries. The `pandas` and `numpy` libraries would each require an additional import (and each with their own dependencies). If we prioritize serial and incremental reading/writing, I think `csv.reader()` or `csv.DictReader()` makes good sense, with `csv.reader()` likely having better performance, and `csv.DictReader()` having an advantage of pre-processing the dataset into `dict` objects, which we need to do anyway. If we prioritize flexibility or configuration, the `pandas` config options may be more robust and expressive overall. The `pandas` library also has pre-built dialects for `Excel` and other common formats. What I don't know (personally) is whether `pandas`'s prioritization of in-memory analytics would introduce any penalty versus other serial `one-record-in-one-record-out` methods which may have a lower RAM footprint. Another disadvantage of using `pandas` is that we then need to ship potentially _all_ SDK taps and targets with the `pandas` library in order to have true universal interop.

UPDATE: Proposal using polars as config base here: https://github.com/meltano/sdk/issues/1584#issuecomment-1499263096

Distinction of config dialect vs actual implementation details.

We can optionally decouple the configuration dialect from how the files actually get processed - although pragmatically speaking, it is easier and more stable if we keep these aligned, at least for the default SDK implementations.

It's also worth calling out that, whatever configuration dialect we chose, native processors will need a translation step. So, if we chose the Pandas config dialect, for instance, which accepts an "Excel" dialect, any source or target receiving this instruction will have to translate this config detail into some set of config parameters that the connected system can understand.

Raising an exception for unsupported dialect options

We likely would need to introduce a new exception class which would be used specifically for the purpose of raising and catching cases where config options are unexpected or unhandled. The raising of this exception might not fail the entire sync operation, but rather this could trigger base-level SDK-driven processors...

Graceful failover to SDK processing in case of unsupported dialects

Of course, if a connected system like Snowflake or Redshift cannot natively understand the dialect options that the user provides, we may have to send these processing directives to the SDK-backed native processors. In this way, we can guarantee that any CSV file sent within our range of expected config options will be successfully processed, even if it cannot be understood natively by the source or target system.

While this graceful failover is nice from a usability perspective, it may produce unexpected effects performance-wise. If the user expects a CSV to be natively processed by target-snowflake, for instance, but the specific dialect is not natively supported by Snowflake (or the dev's Snowflake implementation), then records will be loaded much slower than the end-user might otherwise expect. Which leads us to the next point, regarding an end-user option to fail when native processing is expected but not actually available...

Config option to restrict to native batch-processing only

To mitigate the above, it might be worthwhile to introduce something like a "batch_config": {"require_native_processor": true} config that gives users the ability to fail a pipeline if the config dialect cannot be natively processed by the upstream or downstream system. While this is especially needed for CSV, which has an almost infinite cross-product of dialect options, it also could apply to JSONL and parquet parsing - wherein the tap or target could abort rather than go through the SDK-backed conversion of records to files, which will always be slower than a native bulk export or bulk import operation.

aaronsteers commented 1 year ago

@edgarrmondragon, @kgpayne - Not an urgent priority, but I'm curious if you've got thoughts/suggestions on the above. Personally, I'm leaning towards a standardization based on the csv.DictReader() for the default implementation and also for the default config dialect.

It is worth noting that tap-spreadsheets-anywhere's implementation does use csv.DictReader() (here).

Another option here would be to come up with one single implementation that we think basically all sources and targets will support, and then just mandate (at least for the first implementation) that all processors will exactly match our opinionated configset, or otherwise processing will be handled by the SDK. I don't love this option because we've already had requests from users that they want to ingest very large data files using something like tap-csv to target-snowflake, and they have no control over the dialect that their data is provided in. In order to meet their use case, we need to be able to accept a range of config options that would be a superset of how their own data providers have chosen to publish those datasets.

tayloramurphy commented 1 year ago

@aaronsteers the only thing I have to add is that polars has become more popular and seems to have good performance. Could be interesting to look at for some of these workloads https://www.pola.rs/

WillDaSilva commented 1 year ago

I'll add that polars only has a single Python dependency: typing_extensions. In that regard, it's only beat by sticking to the standard library. This is an advantage of a lot of these Rust-Python libraries that have been gaining popularity lately - they don't pull in a mess of Python dependencies.

aaronsteers commented 1 year ago

I like the idea of using Polars, and you both make good arguments. I especially appreciate the prioritization of streaming the CSV set rather than loading the entire set quickly, rather than loading the full set into memory. Just looking at the docs, it also has great lazy processing capabilities.

Taking then the polars read_csv() and write_csv() config options...

Details

https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_csv.html

We could propose (as a starting point) support for these config options:

encoding - The file encoding to use.
- For our cases, we should probably only support utf-8 in first iteration. We could leave this config value out of our implementation (with utf-8 as a spec requirement), or else treat it as an enum with only one accepted value.
has_header - Indicate if the first row of dataset is a header or not. If set to False, column names will be autogenerated in the following format: column_x, with x being an enumeration over every column in the dataset starting at 1.
separator - Single byte character to use as delimiter in the file.
quote_char - Single byte character used for csv quoting, default = ". Set to None to turn off special handling and escaping of quotes.
null_value - A string representing null values (defaulting to the empty string).
eol_character - Single byte end of line character.

All of the above config options could basically be sent directly as kwargs to polars, and they are options with analogues in virtually all major platforms.

Exceptions and proposed tweaks

Dealing with newline issues

As noted, some targets don't deal will with newling characters. We could optionally add an option for newline_replacement, which would accept a string value to represent the newline character. Sources which receive this as input should start out with a full replacement in their string-like columns, so that the newline character is replaced with the specified input.

Targets receiving this input can either ignore the treatment and just import the replacement string as data. Or, they can implement a replace operation post-load. We could very likely leave this option out in the first iteration - and even if we include it, we would recommend it only be used for targets that have difficulty with newlines.

visch commented 1 year ago

The few things I see almost everyone mess up with csv generation is:

The separator chosen can also come up in your data, you have to have some way to escape that character (Quote enclosing works)
If using quote_char you need a way to escape quotes that are in your data set as it's possible the data itself has ",
Data can contain new line character(s), this should be handled somehow, either using quote_char's or providing an alternate mapping for line feeds

If you solve those couple things (along with enforcing utf-8) that solves almost every class of issue I've hit over lots and lots of csvs (I think we were ingesting them from ~70 unique sources, in ~700 implementations)

Having said that the CSV module that python made surprises me at how good it is all the time! I hit a lot of these issues using a much less used platform (groovy)

aaronsteers commented 1 year ago

@visch - Good points, and yes, I can confirm I've seen each of these in the wild also.

In comparing against my last comment, the one thing I think we're missing that I've seen in other implementations is the escape_char or escape mechanism for the quote_char appearing in the string itself. (Your number 2 above.)

~~I'll check into this ☝️ and see how Polars handles it. Update:~~ since Polars docs do not mention an escape character, but they do mention automatic escaping of the quote character, it appears that they default to escape via doubling the quote character. We probably should document that somewhere. So, The string His name is "Peter". would be escaped as "His name is ""Peter"".".

Your number 3 point is noted above in my "Dealing with newline issues" section. Probably we can start with an assumption that targets will need to handle newlines within the data stream. However, we should also plan for a follow-on story that introduces a tap-level option to replace \n with '\n' if the developer of the target knows that the target chokes when (natively) parsing newlines. (Clever users could solve this with stream maps, but we don't want that to be required for basic interop.)

It's worth noting that platforms that can't natively support newlines in CSV files could just not implement a native CSV import - instead nudging users to prefer JSONL or Parquet, while letting the SDK handle CSV parsing on their behalf.

aaronsteers commented 1 year ago

I think we're very close to a final proposal here.

Let me see if I can summarize:

Spec proposal summary

1. SDK-base taps will ship with support for `"encoding": "csv"` within `batch_config` for taps.

If users specify "encoding": "csv" then SDK-based taps will automatically emit CSV files as their BATCH output. (No dev work required from tap maintainer.)

An config option for encoding_config: {...} will be accepted which contains the above-mentioned CSV-specific config options. (TBD whether it should be optional or required.)

2. SDK-based targets will ship with ability to accept CSV batch files.

This doesn't require any target-level config, because the BATCH record type spec already includes the necessary info regarding encoding and config options.

The SDK will send the encoding_config within the batch record messages, so that targets get the same config spec as was provided to the tap.

3. `.csv.gz` support will be configurable by using `{encoding: csv, compression: gzip}` just as we already have support for `.jsonl.gz` files.

Not explicitly mentioned above, but we should assume that we want compatibility with gzip, just as with .jsonl.gz.

4. Taps and targets can optionally add handling for native CSV processing.

If a native handler is provided by the developer, than the SDK will delegate this work to the tap or target implementation method(s), rather than using the built-in SKD handlers.

5. List of configurable options

has_header - default to True
separator - default to ,
quote_char - default to " (double quote)
null_value - default to empty string
eol_character - default to \n (newline)

6. Unsupported spec options

In first iteration, these would all be unsupported:

Custom escape characters.
Custom number of lines to skip parsing (besides header row).
Non-handling of newlines in record data.
Encodings besides utf-8.

Any targets which cannot natively handle these constraints should not advertise their own native CSV processing methods. Instead, those should let the polars/SDK implementation handle the conversion from CSV file into record messages. (Future iterations could add options here.)

7. Taps will fail if they receive invalid options for CSV config.

If any options are provided that do not match the agreed-upon config options, the tap will fail. SDK can handle this automatically so again, tap developers don't need to write this logic.

TBD

We have yet to decide if the SDK should have any default CSV config for taps - or if the absence of a CSV config spec should just generate the "best practice" config that we describe above as the "defaults". (Regardless of what the user experience is, all messages from tap to target will be explicit in regards to those config options.)

Also undecided is exactly where the extra config should be provided. Above recommends encoding_config as a sibling to the existing encoding option, but there may be other options to consider as well.

aaronsteers commented 1 year ago

Note to readers: I've added the Accepting Pull Requests label. Let us know here or in slack if you would like to contribute!

stale[bot] commented 1 year ago

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

AdamantiumKiwi commented 1 year ago

Just wanted to comment that there is a CSV-related RFC4180, defining CSV format for MIME attachments.

The good news is that I think the spec above meets most if not all of it - specifically, it defines quote escaping using the doubled double-quote.

Also a question - does the unsupported item 3 "Non-handling of newlines in record data." mean that newlines in record data will not be handled, or that they will be (the implied double-negative is confusing)? This is an important requirement for just about any CSV handling - for example, the presence of addresses just about guarantees multi-line values.

tayloramurphy commented 1 year ago

Assuming stalebot messed this one up @edgarrmondragon ?

edgarrmondragon commented 1 year ago

Yeah, this is still relevant

stale[bot] commented 3 months ago

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

meltano / sdk