Open anthonyp opened 7 years ago
I see two separate solutions here.
If a tap emits schemas
for it's records, this target should use the
schema to generate the header for the file and all records should go
to that file. If the records are truly non-rectangular (meaning that
schema
is a superset of various kinds of data returned from the
API, then missing columns for a given record should be marked with
_SINGER_MISSING_COLUMN
in the resulting CSV to disambiguate between
real NULL
values in a record and records that were of a different
shape than other records in the same stream.
If a tap does not emit schemas
, then this target should cut a new
file with new headers matching the given record each time it
encounters records of a different shape than what it had seen already
for that stream. As @micaelbergeron suggested, it would probably be
good for the target to track header configurations as it proceeds so
that it can append to an existing file if the API is returning
records of differing shapes in an interleaved fashion.
Micaël Bergeron [3:06 PM] I would probably keep a schema -> file reference somewhere so you can append back if it comes back the the old schema
md5(schema) -> file, bingo
Note: No solution that requires buffering the entire resultset in memory should be considered acceptable. We still need to stream. :)
PRs are very welcome for either or both of those solutions.
Running into same issue. The main issue in my case, are nested fields which could have content, resulting in new keys
"logic_path": { "1": "7624040", "2": "7624106" }
resulting in :
logic_path__1, logic_path__2
The first record do not have those fields.
I made a nice work around, calling it option 3: Add missing fields to the end of the header.
This solves my problem of shifted fields while reading the CSV. I was diving into the solution for hashing, but the SCHEMA is send only once,
No need for schema housekeeping and hashing. Downside, need to check current header is a superset of current fields, when new fields are found, the whole file (header-actually) is rewritten.
I'll make a PR for my solution so you can check it out.
This target uses the flattened keys from the first record as the headers for the entire spreadsheet. However, in some cases, a tap will produce records with varying keys (for example, this happens with many streams in the HubSpot tap). When this occurs, the data rows in the CSV will mismatch the headers.