stripe-archive / mosql

MongoDB → PostgreSQL streaming replication
MIT License
1.63k stars 225 forks source link

String not valid UTF-8 (BSON::InvalidStringEncoding) #92

Open dcu opened 9 years ago

dcu commented 9 years ago

I have the following exception when importing a collection, the data should be valid since it is already present in the database.

Any ideas?

    /var/lib/gems/1.9.1/gems/bson-1.10.2/lib/bson/bson_c.rb:20:in `serialize': String not valid UTF-8 (BSON::InvalidStringEncoding)
    from /var/lib/gems/1.9.1/gems/bson-1.10.2/lib/bson/bson_c.rb:20:in `serialize'
    from /var/lib/gems/1.9.1/gems/bson-1.10.2/lib/bson.rb:19:in `serialize'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/schema.rb:212:in `transform'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:148:in `block (3 levels) in import_collection'
    from /var/lib/gems/1.9.1/gems/mongo-1.10.2/lib/mongo/cursor.rb:335:in `each'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:147:in `block (2 levels) in import_collection'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:71:in `block in with_retries'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:69:in `times'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:69:in `with_retries'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:146:in `block in import_collection'
    from /var/lib/gems/1.9.1/gems/mongo-1.10.2/lib/mongo/collection.rb:291:in `find'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:145:in `import_collection'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:123:in `block (2 levels) in initial_import'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:121:in `each'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:121:in `block in initial_import'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:109:in `each'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:109:in `initial_import'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:28:in `import'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/cli.rb:167:in `run'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/cli.rb:16:in `run'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/bin/mosql:5:in `<top (required)>'
    from /usr/local/bin/mosql:23:in `load'
    from /usr/local/bin/mosql:23:in `<main>'

Please note this is failing even with the --unsafe flag.

dcu commented 9 years ago

any update on this one?

Winslett commented 9 years ago

I had the same issue. I just monkey patched it to remove the invalid k,v from the obj. I replaced the mosql binary with the following, which I call monkey-patched-mosql. Then, I run the ETL process from the following code, which modifies the MoSQL::Schema.transform method. It could be cleaned up by using a super.

The ETL errors from my data were caused by binary values and larger than expected BSON documents.

#!/usr/bin/env ruby

require 'mosql/cli'

module MoSQL
  class Schema
    def transform(ns, obj, schema=nil, depth = 0)
      schema ||= find_ns!(ns)

      original = obj

      # Do a deep clone, because we're potentially going to be
      # mutating embedded objects.
      obj = BSON.deserialize(BSON.serialize(obj))

      row = []
      schema[:columns].each do |col|

        source = col[:source]
        type = col[:type]

        if source.start_with?("$")
          v = fetch_special_source(obj, source, original)
        else
          v = fetch_and_delete_dotted(obj, source)
          case v
          when Hash
            v = JSON.dump(Hash[v.map { |k,v| [k, transform_primitive(v)] }])
          when Array
            v = v.map { |it| transform_primitive(it) }
            if col[:array_type]
              v = Sequel.pg_array(v, col[:array_type])
            else
              v = JSON.dump(v)
            end
          else
            v = transform_primitive(v, type)
          end
        end
        row << v
      end

      if schema[:meta][:extra_props]
        extra = sanitize(obj)
        row << JSON.dump(extra)
      end

      log.debug { "Transformed: #{row.inspect}" }

      row
    rescue BSON::InvalidStringEncoding, BSON::InvalidDocument
      obj = obj.select do |k,v|
        begin
          BSON.deserialize(BSON.serialize({"#{k}" => v}))
          true
        rescue BSON::InvalidStringEncoding, BSON::InvalidDocument
          puts "Pruning #{k} from the hash."
          false
        end
      end

      raise "tried and failed to prune with #{[ns, obj, schema]}" if depth > 2
      transform(ns, obj, schema, depth + 1)
    end
  end
end

MoSQL::CLI.run(ARGV)
jtmarmon commented 9 years ago

+1 - anyone know what would cause this? I checked the timestamp that it appears to be failing on and I don't see any issues

jtmarmon commented 9 years ago

looks like there was a PR open to resolve this here: #83 which broke tests.