seanharr11 / etlalchemy

Extract, Transform, Load: Any SQL Database in 4 lines of Code.
MIT License
556 stars 83 forks source link

Lost rows #12

Open 27183 opened 7 years ago

27183 commented 7 years ago

This looks like a great tool and I am willing to try to help a bit to make it work. I am losing a few thousand rows with an extraction of a 5,800-row MSSQL table into SQLIte3 table. The source destination is fixed, I can change the destination.

I dug into the code a bit and saved the intermediate file dump of the table read. The rows are all there.

There is some column count confusion when using that file as the source.

Ryan

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/43515615-lost-rows?utm_campaign=plugin&utm_content=tracker%2F41641218&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F41641218&utm_medium=issues&utm_source=github).
seanharr11 commented 7 years ago

Hey Ryan - this is an issue with SQLite3 regarding Row Separators.

See Issue: https://github.com/seanharr11/etlalchemy/issues/11 for more details, but basically SQLite3 by default uses new-lines as the row-separators, which is fine, except that SQLite3 DOESN'T let you:

This is a major problem, because as you've seen, when you try to transfer a database that contains long string columns, theres a good chance that 1 of these columns will have a "\n" or "^M" character in them. This causes that row to be truncated halfway through that column, and corrupts both that row, and the next row. Worse, it does so silently because it doesn't know any better.

There is a proposed workaround outlined in the issue referenced above, which basically:

  1. Changes the row-separator to a more uncommon character (i.e. "{" )
  2. Replaces instances of this uncommon character with a similar character (i.e. "{" => "'(")

If you have a better solution, feel free to open a Pull Request, or feel free to implement the above! I'd love some help!

We could also try and open up a Pull Request with sqlite3 to allow quote-enclosed strings, this is the best solution.

27183 commented 7 years ago

Hi Sean,

Thank you for the detailed response, this is a great explanation.

I am going through Pandas DataFrame as an intermediary and having some success. Any interest in having me try to fold this into your code base or does that introduce unwanted dependencies?

I really like all of the table discovery, transformation hooks and status reporting that you do.

Ryan

On Mar 28, 2017, at 5:49 AM, Sean Harrington <notifications@github.com mailto:notifications@github.com> wrote:

Hey Ryan - this is an issue with SQLite3 regarding Row Separators.

See Issue: #11 https://github.com/seanharr11/etlalchemy/issues/11 for more details, but basically SQLite3 by default uses new-lines as the row-separators, which is fine, except that SQLite3 DOESN'T let you:

Escape new-line characters Enclose string columns in quotes This is a major problem, because as you've seen, when you try to transfer a database that contains long string columns, theres a good chance that 1 of these columns will have a "\n" or "^M" character in them. This causes that row to be truncated halfway through that column, and corrupts both that row, and the next row. Worse, it does so silently because it doesn't know any better.

There is a proposed workaround outlined in the issue referenced above https://github.com/seanharr11/etlalchemy/issues/11, which basically:

Changes the row-separator to a more uncommon character (i.e. "{" ) Replaces instances of this uncommon character with a similar character (i.e. "{" => "'(") If you have a better solution, feel free to open a Pull Request, or feel free to implement the above! I'd love some help!

We could also try and open up a Pull Request with sqlite3 to allow quote-enclosed strings, this is the best solution.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/seanharr11/etlalchemy/issues/12#issuecomment-289759009, or mute the thread https://github.com/notifications/unsubscribe-auth/AD9uUBdpaU0650s0XWVB0tVUg9A-eA9Cks5rqQHsgaJpZM4Mq9x8.

seanharr11 commented 7 years ago

What part of the code would it be updating?

Do you mean you are loading the .sql file into Pandas, and then inserting into SQLite3?

Or are you by-passing the write-to-disk step, storing the rows intermediately in Pandas in memory, and then inserting into SQLite3?

seanharr11 commented 7 years ago

Ryan - any update here? Would love to incorporate your changes if they make the tool more efficient and stable.