singer-io / target-csv

Write Singer data to CSV files
GNU Affero General Public License v3.0
37 stars 68 forks source link

CSV Target assumes first record has same headers as the rest #3

Open anthonyp opened 7 years ago

anthonyp commented 7 years ago

This target uses the flattened keys from the first record as the headers for the entire spreadsheet. However, in some cases, a tap will produce records with varying keys (for example, this happens with many streams in the HubSpot tap). When this occurs, the data rows in the CSV will mismatch the headers.

timvisher commented 5 years ago

I see two separate solutions here.

  1. If a tap emits schemas for it's records, this target should use the schema to generate the header for the file and all records should go to that file. If the records are truly non-rectangular (meaning that schema is a superset of various kinds of data returned from the API, then missing columns for a given record should be marked with _SINGER_MISSING_COLUMN in the resulting CSV to disambiguate between real NULL values in a record and records that were of a different shape than other records in the same stream.

  2. If a tap does not emit schemas, then this target should cut a new file with new headers matching the given record each time it encounters records of a different shape than what it had seen already for that stream. As @micaelbergeron suggested, it would probably be good for the target to track header configurations as it proceeds so that it can append to an existing file if the API is returning records of differing shapes in an interleaved fashion.

    Micaël Bergeron [3:06 PM] I would probably keep a schema -> file reference somewhere so you can append back if it comes back the the old schema

    md5(schema) -> file, bingo

Note: No solution that requires buffering the entire resultset in memory should be considered acceptable. We still need to stream. :)

PRs are very welcome for either or both of those solutions.

abij commented 4 years ago

Running into same issue. The main issue in my case, are nested fields which could have content, resulting in new keys "logic_path": { "1": "7624040", "2": "7624106" }

resulting in : logic_path__1, logic_path__2

The first record do not have those fields.

I made a nice work around, calling it option 3: Add missing fields to the end of the header.

This solves my problem of shifted fields while reading the CSV. I was diving into the solution for hashing, but the SCHEMA is send only once,

No need for schema housekeeping and hashing. Downside, need to check current header is a superset of current fields, when new fields are found, the whole file (header-actually) is rewritten.

I'll make a PR for my solution so you can check it out.