jwills / target-duckdb

A Singer.io target for DuckDB
Other
17 stars 12 forks source link

Invalid Input Error: Wrong NewLine Identifier. Expecting \r\n with github data #32

Open robfe opened 8 months ago

robfe commented 8 months ago

Describe the bug Certain data causes the error duckdb.duckdb.InvalidInputException: Invalid Input Error: Wrong NewLine Identifier. Expecting \r\n

To Reproduce Steps to reproduce the behavior: I was running through the meltano "getting started" tutorial but I made two changes: I targetted duckdb instead of postgres, and I selected all github data.

When I run 'meltano run tap-github target-duckdb', I get the error mentioned

Expected behavior For all the data from github to be uploaded into duckdb (many of the tables do arrive before this error occurs)

Screenshots The entire contents of the CSV in question are:

,[],CONTRIBUTOR,"{""label"": ""robfe:master"", ""ref"": ""master"", ""sha"": ""5432aac028f95d6be47b6cc63e5080175d80b9d8"", ""user"": {""login"": ""robfe"", ""id"": 281453, ""node_id"": ""MDQ6VXNlcjI4MTQ1Mw=="", ""avatar_url"": ""https://avatars.githubusercontent.com/u/281453?v=4"", ""gravatar_id"": """", ""html_url"": ""https://github.com/robfe"", ""type"": ""User"", ""site_admin"": false}, ""repo"": {""id"": 21383686, ""node_id"": ""MDEwOlJlcG9zaXRvcnkyMTM4MzY4Ng=="", ""name"": ""SpecLight"", ""full_name"": ""robfe/SpecLight"", ""html_url"": ""https://github.com/robfe/SpecLight""}}","dependency
",2016-06-14T17:30:58Z,,https://api.github.com/repos/robfe/SpecLight/issues/10/comments,https://api.github.com/repos/robfe/SpecLight/pulls/10/commits,2016-06-13T23:43:26Z,https://github.com/robfe/SpecLight/pull/10.diff,False,"{""label"": ""FinestV:Fix-Broken-System-Web-WebPages-Dependency"", ""ref"": ""Fix-Broken-System-Web-WebPages-Dependency"", ""sha"": ""3cdc44ca834873174b955bc413812dc0e0782703"", ""user"": {""login"": ""FinestV"", ""id"": 13340489, ""node_id"": ""MDQ6VXNlcjEzMzQwNDg5"", ""avatar_url"": ""https://avatars.githubusercontent.com/u/13340489?v=4"", ""gravatar_id"": """", ""html_url"": ""https://github.com/FinestV"", ""type"": ""User"", ""site_admin"": false}, ""repo"": {""id"": 61009600, ""node_id"": ""MDEwOlJlcG9zaXRvcnk2MTAwOTYwMA=="", ""name"": ""SpecLight"", ""full_name"": ""FinestV/SpecLight"", ""html_url"": ""https://github.com/FinestV/SpecLight""}}",https://github.com/robfe/SpecLight/pull/10,73660537,[],False,fb263052304eeb4b1810d9e4d8cd4dc36beb0046,2016-06-14T17:30:58Z,,MDExOlB1bGxSZXF1ZXN0NzM2NjA1Mzc=,10,robfe,https://github.com/robfe/SpecLight/pull/10.patch,,,SpecLight,21383686,[],https://api.github.com/repos/robfe/SpecLight/pulls/comments{/number},https://api.github.com/repos/robfe/SpecLight/pulls/10/comments,closed,https://api.github.com/repos/robfe/SpecLight/statuses/3cdc44ca834873174b955bc413812dc0e0782703,Add dll required to fix the broken System.Web.Webpages reference,2018-02-18T09:36:38Z,https://api.github.com/repos/robfe/SpecLight/pulls/10,"{""login"": ""FinestV"", ""id"": 13340489, ""node_id"": ""MDQ6VXNlcjEzMzQwNDg5"", ""avatar_url"": ""https://avatars.githubusercontent.com/u/13340489?v=4"", ""gravatar_id"": """", ""html_url"": ""https://github.com/FinestV"", ""type"": ""User"", ""site_admin"": false}"

The csv is referencing the data from github for https://github.com/robfe/SpecLight/pull/10

Your environment osx catalina, oh my zsh, with python 3.11.7

Additional context Add any other context about the problem here.

jwills commented 8 months ago

Hrm, is the repo in question private? Can you share enough of the meltano config for me to try to re-run it locally? (I'm assuming it's trying to pull/load all of the data from that SpecLight project?)

robfe commented 8 months ago

Hi @jwills!

I've uploaded the meltano project to https://github.com/robfe/meltano-tutorial-repro, hopefully it behaves the same for you as for me :D

Yes, it's configured to pull all of the data from that speclight project (this was just a learning exercise for me)

But I expect it would repro if I only tried to take the descriptions of the prs - I think the newline character that github is returning as the contents of the PR descriptions is triggering this (and that PR has a trailing newline on the description)

robfe commented 8 months ago

FWIW it looks like duckdb can correctly load that CSV (see the whitespace char in column04's cell)

image
robfe commented 8 months ago

FYI I've found a workaround: To use stream mapping on the source to strip out any newline chars:

      stream_maps:
        commits:
          commit: __NULL__
          commit_message: "commit.message.replace('\\r', '<br/>').replace('\\n', '<br/>')"
jwills commented 8 months ago

ah nice-- was just about to run this locally, I'll see if there is anything I can do to make the process a bit smoother here.

robfe commented 7 months ago

Hi @jwills have you had any luck?

I can confirm that it doesn't matter what the tap / data source is - if there are newlines present in any of the cells, then this error occurs

However it's not occuring 100% of the time. I had a 267 row postgres table (with text columns that contained lots of newlines) that loaded file on first run with meltano, but once I ran an incremental load, a single row crashed the loader

(This is before attempting the above workaround on that particular project, which involves a LOT of tables and columns :| )

jwills commented 7 months ago

ah no I promptly forgot about this issue, sorry!

The workaround stopped working, or you encountered a new use case where it doesn't, or?

robfe commented 7 months ago

The workaround works but you have to apply it on a column by column basis. Also, the outcome is you have to decide what to use instead of a new line character.

My production use case for this loader is to replicate our transactional database into duckdb. Unlike the GitHub example above where I was trying things out, the transactional db has a few hundred text columns. These could hold simple strings, json, markdown, html, comments, descriptions etc…

it would be ideal if I didn’t have to figure out on a case by case basis what a good replacement for a newline character is for each column

it’s still possible of course! But if there’s any way I can help solve this at the root instead of working around the issue each (numerous) time it’s encountered, I’d be really keen to help…

jwills commented 7 months ago

yeah that totally makes sense, thanks; I'm trying to think about how I would correctly escape all of the newlines in the general case though, that seems hard.

Looking at how we're doing this here: https://github.com/jwills/target-duckdb/blob/main/target_duckdb/db_sync.py#L378

...have you tried setting a quotechar for the ingest? I'm wondering how much of that this would solve. I may also need to adjust that COPY statement below to explicitly configure the newline/delimiter/quotechars.

jwills commented 7 months ago

Something like this, maybe https://github.com/jwills/target-duckdb/pull/34

robfe commented 6 months ago

Something like this, maybe #34

I am not 100% sure if this will fix it and i'm not sure how to test it until you release (not so great with python sorry)

But something that I do think will work is: When you're writing out the CSV file, any time you write a string cell, make sure the string's newlines are all \r\n not just \n

Currently i do the same effect in stream mappings with the expression: str((self or "").replace("\r\n", "\n").replace("\n", "\r\n"))

srpwnd commented 6 months ago

I'm encountering the same issue. I tried running the #34 version but nothing changes and I still recieve

     cur.execute("COPY {} FROM '{}' WITH (delim '{}', quote '{}')".format(temp_table, temp_file_csv, self.delimiter, self.quotechar)) cmd_type=elb consumer=True name=target-duckdb producer=False stdio=stderr string_id=target-duckdb
duckdb.duckdb.InvalidInputException: Invalid Input Error: Wrong NewLine Identifier. Expecting \r\n cmd_type=elb consumer=True name=target-duckdb producer=False stdio=stderr string_id=target-duckdb

I'm currently trying to get this running with tap-hubspot.

Could setting the new_line parameter for CSVs possibly help?

srpwnd commented 6 months ago

So I've tested my assumption and got it working in #35.

Could you also try it out on your data @robfe?