stripe-archive / mosql

MongoDB → PostgreSQL streaming replication
MIT License
1.62k stars 224 forks source link

GridFS Support #90

Open apocolipse opened 9 years ago

apocolipse commented 9 years ago

I'm curious if you've looked into GridFS support, being that gridfs is split across 2 collections, they're consistently named (fs.files, fs.chunks), and the standalone adapter for gridfs file getting (by filename or id), I think it merits its own functionality, rather than just mappign both collections to postgres and trying to do assembly on that side. I did some preliminary testing to see if it could work (using '$gridfs' special as a source to trigger gridfs, and then using orig document to grab gridfs file by id)

I'm currently running into some issues with encoding however, some imports succeeding (large plaintext files, some pdfs) and then failing at one point on others on

# in transform_to_copy()
'join': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)

my modification of fetch_special_source():

def fetch_special_source(db, ns, obj, source, original)
      case source
      when "$timestamp"
        Sequel.function(:now)
      when "$gridfs"
        dbname, collection = ns.split(".", 2)
        if collection == 'fs.files'
          grid = Grid.new(db)
          file = grid.get(original["_id"])
          Sequel::SQL::Blob.new(file.read)
        end
      when /^\$exists (.+)/
        # We need to look in the cloned original object, not in the version that
        # has had some fields deleted.
        fetch_exists(original, $1)
      else
        raise SchemaError.new("Unknown source: #{source}")
      end
    end

(I also tried various combinations of hex transforms and utf8 encoding, it still ended up eventually giving me that ASCII error, for reference my column type its inserting into is BYTEA)

Also, I had to add db adapter arguments in all methods up from fetch_special_source() in shema.rb to import_collections() in streamer.rb inorder to create the gridfs object instance in fetch_special_source(), this seems bad, recommendation for where to stick it?

nelhage commented 9 years ago

Hey – I haven't looked at implement gridfs support, since I don't use it anywhere.

I agree that adding support might be useful, and I'd consider a PR. It'd probably be easier to review a strawman PR than try to speculate about the code via a description.

hex-encoding the binary data is probably the way forward to fix the encoding issue, but I'd try to replicate it in a test and then add a bunch of debug prints or thereabouts to understand what's going on.