Open aaronsteers opened 1 year ago
@edgarrmondragon, @kgpayne - Not an urgent priority, but I'm curious if you've got thoughts/suggestions on the above. Personally, I'm leaning towards a standardization based on the csv.DictReader()
for the default implementation and also for the default config dialect.
It is worth noting that tap-spreadsheets-anywhere
's implementation does use csv.DictReader()
(here).
Another option here would be to come up with one single implementation that we think basically all sources and targets will support, and then just mandate (at least for the first implementation) that all processors will exactly match our opinionated configset, or otherwise processing will be handled by the SDK. I don't love this option because we've already had requests from users that they want to ingest very large data files using something like tap-csv to target-snowflake, and they have no control over the dialect that their data is provided in. In order to meet their use case, we need to be able to accept a range of config options that would be a superset of how their own data providers have chosen to publish those datasets.
@aaronsteers the only thing I have to add is that polars
has become more popular and seems to have good performance. Could be interesting to look at for some of these workloads https://www.pola.rs/
I'll add that polars
only has a single Python dependency: typing_extensions
. In that regard, it's only beat by sticking to the standard library. This is an advantage of a lot of these Rust-Python libraries that have been gaining popularity lately - they don't pull in a mess of Python dependencies.
I like the idea of using Polars, and you both make good arguments. I especially appreciate the prioritization of streaming the CSV set rather than loading the entire set quickly, rather than loading the full set into memory. Just looking at the docs, it also has great lazy processing capabilities.
Taking then the polars read_csv()
and write_csv()
config options...
We could propose (as a starting point) support for these config options:
encoding
- The file encoding to use.
utf-8
in first iteration. We could leave this config value out of our implementation (with utf-8 as a spec requirement), or else treat it as an enum with only one accepted value.has_header
- Indicate if the first row of dataset is a header or not. If set to False, column names will be autogenerated in the following format: column_x
, with x
being an enumeration over every column in the dataset starting at 1.separator
- Single byte character to use as delimiter in the file.quote_char
- Single byte character used for csv quoting, default = "
. Set to None
to turn off special handling and escaping of quotes.null_value
- A string representing null values (defaulting to the empty string).eol_character
- Single byte end of line character.All of the above config options could basically be sent directly as kwargs to polars, and they are options with analogues in virtually all major platforms.
As noted, some targets don't deal will with newling characters. We could optionally add an option for newline_replacement
, which would accept a string value to represent the newline character. Sources which receive this as input should start out with a full replacement in their string-like columns, so that the newline character is replaced with the specified input.
Targets receiving this input can either ignore the treatment and just import the replacement string as data. Or, they can implement a replace operation post-load. We could very likely leave this option out in the first iteration - and even if we include it, we would recommend it only be used for targets that have difficulty with newlines.
The few things I see almost everyone mess up with csv generation is:
quote_char
you need a way to escape quotes that are in your data set as it's possible the data itself has ",
quote_char
's or providing an alternate mapping for line feedsIf you solve those couple things (along with enforcing utf-8) that solves almost every class of issue I've hit over lots and lots of csvs (I think we were ingesting them from ~70 unique sources, in ~700 implementations)
Having said that the CSV module that python made surprises me at how good it is all the time! I hit a lot of these issues using a much less used platform (groovy)
@visch - Good points, and yes, I can confirm I've seen each of these in the wild also.
In comparing against my last comment, the one thing I think we're missing that I've seen in other implementations is the escape_char
or escape mechanism for the quote_char
appearing in the string itself. (Your number 2 above.)
I'll check into this ☝️ and see how Polars handles it. Update: since Polars docs do not mention an escape character, but they do mention automatic escaping of the quote character, it appears that they default to escape via doubling the quote character. We probably should document that somewhere. So, The string His name is "Peter".
would be escaped as "His name is ""Peter""."
.
Your number 3 point is noted above in my "Dealing with newline issues" section. Probably we can start with an assumption that targets will need to handle newlines within the data stream. However, we should also plan for a follow-on story that introduces a tap-level option to replace \n
with '\n'
if the developer of the target knows that the target chokes when (natively) parsing newlines. (Clever users could solve this with stream maps, but we don't want that to be required for basic interop.)
It's worth noting that platforms that can't natively support newlines in CSV files could just not implement a native CSV import - instead nudging users to prefer JSONL or Parquet, while letting the SDK handle CSV parsing on their behalf.
I think we're very close to a final proposal here.
Let me see if I can summarize:
"encoding": "csv"
within batch_config
for taps.If users specify "encoding": "csv"
then SDK-based taps will automatically emit CSV files as their BATCH output. (No dev work required from tap maintainer.)
An config option for encoding_config: {...}
will be accepted which contains the above-mentioned CSV-specific config options. (TBD whether it should be optional or required.)
This doesn't require any target-level config, because the BATCH
record type spec already includes the necessary info regarding encoding and config options.
The SDK will send the encoding_config
within the batch record messages, so that targets get the same config spec as was provided to the tap.
.csv.gz
support will be configurable by using {encoding: csv, compression: gzip}
just as we already have support for .jsonl.gz
files.Not explicitly mentioned above, but we should assume that we want compatibility with gzip, just as with .jsonl.gz
.
If a native handler is provided by the developer, than the SDK will delegate this work to the tap or target implementation method(s), rather than using the built-in SKD handlers.
has_header
- default to True
separator
- default to ,
quote_char
- default to "
(double quote)null_value
- default to empty stringeol_character
- default to \n
(newline)In first iteration, these would all be unsupported:
utf-8
.Any targets which cannot natively handle these constraints should not advertise their own native CSV processing methods. Instead, those should let the polars/SDK implementation handle the conversion from CSV file into record messages. (Future iterations could add options here.)
If any options are provided that do not match the agreed-upon config options, the tap will fail. SDK can handle this automatically so again, tap developers don't need to write this logic.
We have yet to decide if the SDK should have any default CSV config for taps - or if the absence of a CSV config spec should just generate the "best practice" config that we describe above as the "defaults". (Regardless of what the user experience is, all messages from tap to target will be explicit in regards to those config options.)
Also undecided is exactly where the extra config should be provided. Above recommends encoding_config
as a sibling to the existing encoding
option, but there may be other options to consider as well.
Note to readers: I've added the Accepting Pull Requests
label. Let us know here or in slack if you would like to contribute!
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen
label, or request that it be added.
Just wanted to comment that there is a CSV-related RFC4180, defining CSV format for MIME attachments.
The good news is that I think the spec above meets most if not all of it - specifically, it defines quote escaping using the doubled double-quote.
Also a question - does the unsupported item 3 "Non-handling of newlines in record data." mean that newlines in record data will not be handled, or that they will be (the implied double-negative is confusing)? This is an important requirement for just about any CSV handling - for example, the presence of addresses just about guarantees multi-line values.
Assuming stalebot messed this one up @edgarrmondragon ?
Yeah, this is still relevant
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen
label, or request that it be added.
As noted elsewhere, this is relatively trivial to implement engineering-wise but non-trivial config-wise/spec-wise, and specifically because there isn't and never has been an exact "CSV spec".
"CSV" is really not a single spec
Rather, CSV should be consider to be a family of related formats, which may or may not be comma delimited. Wikipedia article (unintentionally) does a pretty good job of highlighting the challenges here:
From https://en.wikipedia.org/wiki/Comma-separated_values (emphasis mine):
Commonly unsupported options
Examples of classic challenges when using CSV files as an interchange format:
';'
or''
(empty string). Sometimes it is preferable to replace the newline character with'\n'
, and then the dbt layer or another post-processing step may replace the string'\n'
with an actual newline character.Not re-inventing the wheel
As has been helpful in other contexts, I'll propose we start with an already existing spec for config.
Details
According to ChatGPT these are the most popular ways to process CSV files with Python: > 1. csv.reader(): This is a built-in Python library that provides a fast and efficient way to read CSV files. It returns an iterator that can be used to iterate over the rows in the CSV file. This is the most popular method of reading CSV files in Python. > 1. pandas.read_csv(): This is a popular method of reading CSV files in Python. It uses the pandas library, which provides high-performance, easy-to-use data structures and data analysis tools. It returns a DataFrame object that can be used to manipulate the data. > 1. numpy.loadtxt(): This is another popular method of reading CSV files in Python. It uses the numpy library, which provides a powerful array computing and linear algebra toolset. It returns an array object that can be used to manipulate the data. > 1. csv.DictReader(): This is similar to csv.reader(), but it returns a dictionary object for each row, with the keys being the column headers and the values being the row data. This can be useful if you need to access specific columns of the data. Essentially, this would point us towards one of three libraries `csv`, `pandas`, or `numpy`. The `csv` library is the only one shipped with standard libraries. The `pandas` and `numpy` libraries would each require an additional import (and each with their own dependencies). If we prioritize serial and incremental reading/writing, I think `csv.reader()` or `csv.DictReader()` makes good sense, with `csv.reader()` likely having better performance, and `csv.DictReader()` having an advantage of pre-processing the dataset into `dict` objects, which we need to do anyway. If we prioritize flexibility or configuration, the `pandas` config options may be more robust and expressive overall. The `pandas` library also has pre-built dialects for `Excel` and other common formats. What I don't know (personally) is whether `pandas`'s prioritization of in-memory analytics would introduce any penalty versus other serial `one-record-in-one-record-out` methods which may have a lower RAM footprint. Another disadvantage of using `pandas` is that we then need to ship potentially _all_ SDK taps and targets with the `pandas` library in order to have true universal interop.UPDATE: Proposal using polars as config base here: https://github.com/meltano/sdk/issues/1584#issuecomment-1499263096
Distinction of config dialect vs actual implementation details.
We can optionally decouple the configuration dialect from how the files actually get processed - although pragmatically speaking, it is easier and more stable if we keep these aligned, at least for the default SDK implementations.
It's also worth calling out that, whatever configuration dialect we chose, native processors will need a translation step. So, if we chose the Pandas config dialect, for instance, which accepts an
"Excel"
dialect, any source or target receiving this instruction will have to translate this config detail into some set of config parameters that the connected system can understand.Raising an exception for unsupported dialect options
We likely would need to introduce a new exception class which would be used specifically for the purpose of raising and catching cases where config options are unexpected or unhandled. The raising of this exception might not fail the entire sync operation, but rather this could trigger base-level SDK-driven processors...
Graceful failover to SDK processing in case of unsupported dialects
Of course, if a connected system like Snowflake or Redshift cannot natively understand the dialect options that the user provides, we may have to send these processing directives to the SDK-backed native processors. In this way, we can guarantee that any CSV file sent within our range of expected config options will be successfully processed, even if it cannot be understood natively by the source or target system.
While this graceful failover is nice from a usability perspective, it may produce unexpected effects performance-wise. If the user expects a CSV to be natively processed by
target-snowflake
, for instance, but the specific dialect is not natively supported by Snowflake (or the dev's Snowflake implementation), then records will be loaded much slower than the end-user might otherwise expect. Which leads us to the next point, regarding an end-user option to fail when native processing is expected but not actually available...Config option to restrict to native batch-processing only
To mitigate the above, it might be worthwhile to introduce something like a
"batch_config": {"require_native_processor": true}
config that gives users the ability to fail a pipeline if the config dialect cannot be natively processed by the upstream or downstream system. While this is especially needed for CSV, which has an almost infinite cross-product of dialect options, it also could apply toJSONL
andparquet
parsing - wherein the tap or target could abort rather than go through the SDK-backed conversion of records to files, which will always be slower than a native bulk export or bulk import operation.