Binary data from SQL Server to PostgreSQL

endersonmaia commented 8 years ago

Hi, I'm migrating the data from SQL Server to PostgreSQL. I used the [dump]:(https://github.com/toy/dump), that worked almost 100%. I got an error with the bytea type in PostgresSQL.

I created this simple script to make the transfer where dump failed.

#!/usr/bin/env ruby
require 'sequel'

DB_src = Sequel.connect('tinytds://HOST/redmine')
DB_dst = Sequel.connect('postgres://HOST/redmine')

src_wcv = DB_src[:wiki_content_versions]
dst_wcv = DB_dst[:wiki_content_versions]

src_wcv.each { |r| dst_wcv << r }

I get this error

PG::CharacterNotInRepertoire: ERROR:  invalid byte sequence for encoding "UTF8": 0xda 0xcb (Sequel::DatabaseError)

I think since the filed is of type bytea, it shouldn't complain about UTF8 byte sequence.

Fetching the data via irb, the output is this :

irb(main):016:0> wcv.first[:data]
=> "x\xDA\xCB0\xD4S\b\xCF\xCC\xCE\x04\x00\n)\x02|"

I told the same here toy/dump#21.

What am I missing ?

jeremyevans commented 8 years ago

The tinytds adapter doesn't support returning blob data as Sequel::SQL::Blob, which means that they won't be literalized as a blob. You need to manually convert the blob values from String to Sequel::SQL::Blob instances before inserting them. I'm sorry there isn't an easier way, if tiny_tds adds support for custom type converters, I'll add support for it in Sequel.

endersonmaia commented 8 years ago

I don't think the problem is with TinyTDS, since I can print the binary output, and since the data inside this field is a gziped data, I can even get it's content.

I think it's more like a PostgreSQL adapter problem, that's is complaining for UNICODE, when it should just put the bytes inside the bytea field.

As you can see, I can get the data form Sequel with TinyTDS fine :

irb(main):001:0> require 'sequel'
=> true
irb(main):002:0> require 'zlib'
=> true
irb(main):003:0> DB_src = Sequel.connect('tinytds://HOST/redmine')
=> #<Sequel::TinyTDS::Database: "tinytds://HOST/redmine">
irb(main):004:0> src_wcv = DB_src[:wiki_content_versions]
=> #<Sequel::TinyTDS::Dataset: "SELECT * FROM [WIKI_CONTENT_VERSIONS]">
irb(main):008:0> src_wcv.first[:data]
=> "x\xDA\xCB0\xD4S\b\xCF\xCC\xCE\x04\x00\n)\x02|"
irb(main):009:0> Zlib::Inflate.inflate(src_wcv.first[:data])
=> "h1. Wiki"

jeremyevans commented 8 years ago

What is src_wcv.first[:data].class? If it is String, there's your problem. I needs to be Sequel::SQL::Blob in order for PostgreSQL to escape it as a bytea instead of a String. I didn't say this was a problem with tiny_tds per se, just that tiny_tds does not expose an interface that allows customization of the type converters, so Sequel doesn't have a way to automatically convert such columns to use Sequel::SQL::Blob.

endersonmaia commented 8 years ago

Ok, now I get it. Solved calling Sequel::SQL::Blob.new() on each row for the :data field, before calling the <<.

Thanks!

metaskills commented 8 years ago

Thanks gentleman for ping me. TinyTDS will likely never allow custom casting at the C level to another type. If a binary column is returned, we give you a binary encoded string. Our goal is to return a ruby primitive for each matching SQL Server type. We want other gems like the SQL Server adapter to take the work from there. Hope this helps.

jeremyevans commented 8 years ago

That's understandable, but a shame, as it forces callers to post-process the output at a significant speed hit if they want custom types.

metaskills commented 8 years ago

It depends. Most raw connections when I last looked returned strings. TinyTDS like MySQL casting to a matching Ruby primitive in C made it really fast. I mean really fast. So is it fair to say "significant" speed hit? Maybe for large binary data? But maybe I misunderstand the problem. If you have some time:

Why is wrapping a binary encoded string in a Sequel::SQL::Blob desirable?
Why does that object not act like a regular Ruby IO/String?

This could be my ignorance on the project. Just hoping to get my head around what this request was and learn a bit.

jeremyevans commented 8 years ago

Post processing all output from TinyTDS is a significant speed hit. It is true that doing the casting in C is faster than doing the casting in ruby, though. The best of both worlds would be to accept ruby procs for casting certain data types (e.g. https://github.com/jeremyevans/sequel_pg), and use those if given.

Note that Sequel::SQL::Blob is a subclass of String, it's a separate subclass because most databases require blobs to be escaped differently than regular strings. Doing this based on the encoding is a bad idea, as just because a string has a binary encoding does not mean it should be escaped as a blob.

Obviously, using Sequel::SQL::Blob as a default in TinyTDS would not make sense. Hence the desire for an interface where users of TinyTDS could setup custom type casting per connection.

metaskills commented 8 years ago

Thanks @jeremyevans that helps build my context up. Is Sequel heavily opinionated on the raw connection proc casting and/or does it provide type conversion when low level connections like TinyTDS do not provide that feature? Just wondering if there was a pull request opportunity to made the TinyTDS adapter better for Sequel. TIA!

jeremyevans commented 8 years ago

In general, when a driver does low level casting but does not support casting of specific types (such as blobs), Sequel chooses the faster approach of not overriding the low level casting. This is because most users would prefer things be faster, and it is not fair to burden them if they are not using those specific types.

In cases where the driver provides strings, and Sequel does the typecasting itself, then Sequel handle's the specific types if it is possible, since there is in general not a performance hit for doing so.

An example API for this would be something like:

c = TinyTds::Client.new
c.type_convertors['blob'] = proc{|v| ...}

I'm not sure if that particular API would work for TinyTDS, it probably depends on FreeTDS's internal APIs. For example, PostgreSQL and MySQL use type numbers instead of type names.

djay87 commented 6 years ago

@endersonmaia

Could you please provide the final script which actually worked for you ?

endersonmaia commented 6 years ago

Hey @djay87

#!/usr/bin/env ruby
require 'sequel'

DB_src = Sequel.connect('tinytds://user:password@host/database')
src_wcv = DB_src[:wiki_content_versions]

DB_dst = Sequel.connect('postgres://user:password@host/database')
dst_wcv = DB_dst[:wiki_content_versions]

src_wcv.each do |src|
  src[:data] = Sequel::SQL::Blob.new(src[:data])
  dst_wcv << src
end

jeremyevans / sequel

Binary data from SQL Server to PostgreSQL #1132