stripe-archive / mosql

MongoDB → PostgreSQL streaming replication
MIT License
1.63k stars 225 forks source link

Document too large: This BSON document is limited to 4194304 bytes. (BSON::InvalidDocument) #101

Open wdarosh opened 9 years ago

wdarosh commented 9 years ago

I have been working with MongoDB 2.4.12 attempting to migrate to PostgreSQL 9.4.X for a system migration. Most of the collections translate but I am unable to get past this error.

I have tried swapping up the driver however I had had no luck with MoSQL detecting and utilizing the new driver.

/var/lib/gems/1.9.1/gems/bson-1.12.3/lib/bson/bson_c.rb:20:in `serialize': Document too large: This BSON document is limited to 4194304 bytes. (BSON::InvalidDocument)
        from /var/lib/gems/1.9.1/gems/bson-1.12.3/lib/bson/bson_c.rb:20:in `serialize'
        from /var/lib/gems/1.9.1/gems/bson-1.12.3/lib/bson.rb:19:in `serialize'
        from /var/lib/gems/1.9.1/gems/mosql-0.4.3/lib/mosql/schema.rb:212:in `transform'
        from /var/lib/gems/1.9.1/gems/mosql-0.4.3/lib/mosql/streamer.rb:147:in `block (3 levels) in import_collection'
        from /var/lib/gems/1.9.1/gems/mongo-1.12.3/lib/mongo/cursor.rb:343:in `each'
        from /var/lib/gems/1.9.1/gems/mosql-0.4.3/lib/mosql/streamer.rb:146:in `block (2 levels) in import_collection'
        from /var/lib/gems/1.9.1/gems/mosql-0.4.3/lib/mosql/streamer.rb:70:in `block in with_retries'
        from /var/lib/gems/1.9.1/gems/mosql-0.4.3/lib/mosql/streamer.rb:68:in `times'
        from /var/lib/gems/1.9.1/gems/mosql-0.4.3/lib/mosql/streamer.rb:68:in `with_retries'
        from /var/lib/gems/1.9.1/gems/mosql-0.4.3/lib/mosql/streamer.rb:145:in `block in import_collection'
        from /var/lib/gems/1.9.1/gems/mongo-1.12.3/lib/mongo/collection.rb:291:in `find'
        from /var/lib/gems/1.9.1/gems/mosql-0.4.3/lib/mosql/streamer.rb:144:in `import_collection'
        from /var/lib/gems/1.9.1/gems/mosql-0.4.3/lib/mosql/streamer.rb:122:in `block (2 levels) in initial_import'
        from /var/lib/gems/1.9.1/gems/mosql-0.4.3/lib/mosql/streamer.rb:120:in `each'
        from /var/lib/gems/1.9.1/gems/mosql-0.4.3/lib/mosql/streamer.rb:120:in `block in initial_import'
        from /var/lib/gems/1.9.1/gems/mosql-0.4.3/lib/mosql/streamer.rb:108:in `each'
        from /var/lib/gems/1.9.1/gems/mosql-0.4.3/lib/mosql/streamer.rb:108:in `initial_import'
        from /var/lib/gems/1.9.1/gems/mosql-0.4.3/lib/mosql/streamer.rb:28:in `import'
        from /var/lib/gems/1.9.1/gems/mosql-0.4.3/lib/mosql/cli.rb:167:in `run'
        from /var/lib/gems/1.9.1/gems/mosql-0.4.3/lib/mosql/cli.rb:16:in `run'
        from /var/lib/gems/1.9.1/gems/mosql-0.4.3/bin/mosql:5:in `<top (required)>'
        from /usr/local/bin/mosql:23:in `load'
        from /usr/local/bin/mosql:23:in `<main>'
dmitrypisanko commented 8 years ago

I have the same problem. Any solution?

bbdurall commented 8 years ago

I've figured out a solution, but my Ruby knowledge is very minimal, so I'll need some assistance in getting this patch into the proper form to add to the repo.

First, install the deep clone gem, from the Unix shell: gem install ruby_deep_clone

Then, comment out line 212 of schema.rb: obj = BSON.deserialize(BSON.serialize(obj))

and underneath insert the following lines: require "deep_clone" obj = DeepClone.clone(original)

I don't think this is the proper way to introduce an external dependency to the project, but as a quick hack it worked for me. It's quite slow on large objects (it took over 5 mins to process: ~2000 rows containing large PDFs), but it eventually inserts them into the postgres db.

ebroder commented 8 years ago

Hmm, the issue here is likely that BSON.serialize uses the original default maximum BSON size (4MB). The maximum has since been raised, but increasing it relies on negotiating the new limit with the connection.

Replacing BSON.serialize with something like BSON::BSON_CODER.serialize(obj, false, false, 16*1024*1024) will likely also fix your issue (without requiring a new dependency)

bbdurall commented 8 years ago

I've verified, changing line 212 of schema.rb to: obj = BSON.deserialize(BSON::BSON_CODER.serialize(obj, false, false, 16*1024*1024)) fixes the issue. I tried to push up a new branch for the fix, but I don't seem to have permission to do so. What's the best way to get this fix into the master branch?