matthewfranglen / postgres-elasticsearch-fdw

Postgres to Elastic Search Foreign Data Wrapper
MIT License
108 stars 32 forks source link

could it support chinese? #21

Open houzi1099857978 opened 3 years ago

houzi1099857978 commented 3 years ago

when i try to insert into chinese,i got zhe flowing error:

postgres@linux-dtq3:/home/fdw/postgres-elasticsearch-fdw-master> 2021-03-05 05:37:13.358 EST [1705] 错误:  INDEX for /test-index/doc/1 and document {'istrue': u't', 'testjson': {}, 'createdate': datetime.date(2021, 3, 5), 'time': u'11:27:00', 'flag': u'111', 'num': 666L, 'testbytea': '\xe6\xb5\x8b\xe8\xaf\x95', 'address': u'\u5317\u4eac\u5e02\u4e30\u53f0\u533a', 'price2': u'66.666', 'message': u'\u6d4b\u8bd5\u5b57\u6bb5\u6d4b\u8bd5\u5b57\u6bb5\u6d4b\u8bd5\u5b57\u6bb5\u6d4b\u8bd5\u5b57\u6bb5', 'datetime': datetime.datetime(2021, 3, 5, 12, 22, 54), 'price': 11.01} failed: ({'istrue': u't', 'testjson': {}, 'createdate': datetime.date(2021, 3, 5), 'time': u'11:27:00', 'flag': u'111', 'num': 666L, 'testbytea': '\xe6\xb5\x8b\xe8\xaf\x95', 'address': u'\u5317\u4eac\u5e02\u4e30\u53f0\u533a', 'price2': u'66.666', 'message': u'\u6d4b\u8bd5\u5b57\u6bb5\u6d4b\u8bd5\u5b57\u6bb5\u6d4b\u8bd5\u5b57\u6bb5\u6d4b\u8bd5\u5b57\u6bb5', 'datetime': datetime.datetime(2021, 3, 5, 12, 22, 54), 'price': 11.01}, UnicodeDecodeError('ascii', '"\xe6\xb5\x8b\xe8\xaf\x95"', 1, 2, 'ordinal not in range(128)'))
2021-03-05 05:37:13.358 EST [1705] 语句:  insert into test_es(id,num,flag,istrue,createdate,price,price2,datetime,address,message,testbytea,testjson,time) 
    values(1,666,'111'::"bit",true,'2021-03-05',11.01,66.666,'2021-03-05 12:22:54','北京市丰台区','测试字段测试字段测试字段测试字段','测试'::bytea,'{}','11:27')
matthewfranglen commented 3 years ago

I've created this branch: https://github.com/matthewfranglen/postgres-elasticsearch-fdw/tree/chinese which has a test for inserting Chinese. It passes.

There seems to be a problem with your system configuration for this as it is attempting to decode the string as ascii:

UnicodeDecodeError('ascii', '"\xe6\xb5\x8b\xe8\xaf\x95"', 1, 2, 'ordinal not in range(128)')

At the moment I don't really have a solid idea why your system is defaulting to decoding the string as ascii. I suggest looking at how the docker images differ to your system.

matthewfranglen commented 3 years ago

I think you should look into your locale settings. See locale.getdefaultlocale which may be the cause of your issue.

Inside the postgres docker container the $LANG is en_US.utf8.

houzi1099857978 commented 3 years ago

I find the problem is not because i used chinese,it is couse by useing bytea type. when i insert into chinese to a bytea column,i will use the flowing sql to convert data to bytea by the flowing sql: INSERT INTO test_es(id,testbytea) values (1,'中文测试'::bytea); that will couse the problem.

matthewfranglen commented 3 years ago

What's the table definition?

houzi1099857978 commented 3 years ago

What's the table definition?

CREATE EXTENSION multicorn;

CREATE SERVER multicorn_es FOREIGN DATA WRAPPER multicorn
OPTIONS (
  wrapper 'pg_es_fdw.ElasticsearchFDW'
);

DROP FOREIGN TABLE IF EXISTS "public"."test_es";
CREATE FOREIGN TABLE test_es (
  "id" int8 NOT NULL,
  "num" int4,
  "flag" bit(3),
  "istrue" bool,
  "createdate" date,
  "price" numeric(5,2),
  "price2" float8,
  "datetime" timestamp(6),
  "address" varchar(255) COLLATE "pg_catalog"."default",
  "message" text COLLATE "pg_catalog"."default",
  "testbytea" bytea,
  "testjson" json,
  "time" time(6)
)
SERVER multicorn_es
OPTIONS(
   host '10.10.0.160',
   port '9200',
   index 'test_es',
   type 'doc',
   rowid_column 'id',
   query_column 'query',
   query_dsl 'false',
--   score_column 'message',
--   default_sort 'last_updated:desc',
--   sort_column 'id',
   refresh 'false',
   complete_returning 'false',
   timeout '20',
   username 'elastic',
   password 'changeme'
)
;
INSERT INTO "public"."test_es" VALUES (1, 666, '111', 't', '2021-03-05', 11.01, 66.666, '2021-03-05 12:22:54', 'asdfa', 'adsfaaaa', '中文测试'::bytea, '{}', '11:27:00');
matthewfranglen commented 3 years ago

The elasticsearch schema would be great if you have that as well.

houzi1099857978 commented 3 years ago

The elasticsearch version is 6.8.6,the index is created by default when i insert data

{
  "state": "open",
  "settings": {
    "index": {
      "creation_date": "1614943999235",
      "number_of_shards": "5",
      "number_of_replicas": "1",
      "uuid": "30jZAEgwQLunvvH8A1FzGA",
      "version": {
        "created": "6080699"
      },
      "provided_name": "test_es"
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "address": {
          "type": "text",
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          }
        },
        "flag": {
          "type": "text",
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          }
        },
        "num": {
          "type": "long"
        },
        "createdate": {
          "type": "date"
        },
        "message": {
          "type": "text",
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          }
        },
        "istrue": {
          "type": "text",
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          }
        },
        "datetime": {
          "type": "date"
        },
        "testbytea": {
          "type": "text",
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          }
        },
        "price666": {
          "type": "text",
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          }
        },
        "testjson": {
          "type": "object"
        },
        "price": {
          "type": "float"
        },
        "time": {
          "type": "text",
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          }
        },
        "price2": {
          "type": "text",
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          }
        }
      }
    }
  },
  "aliases": [],
  "primary_terms": {
    "0": 1,
    "1": 1,
    "2": 1,
    "3": 1,
    "4": 1
  },
  "in_sync_allocations": {
    "0": [
      "7UsFkimCRji0kviOgKk7gg",
      "srAnowRfQBSmlzSTGge6bA"
    ],
    "1": [
      "OhjG7VQ2Tm2l5VKU3c7IZw",
      "w_y52WfjTuGU-p7unXTQ6w"
    ],
    "2": [
      "rpfDNJheRjSdLpg-ZYkxMg",
      "YT1fVCEZT1mrdNyFrqnSbg"
    ],
    "3": [
      "8OCOll5qQjyyJ11pgyczzw",
      "Osh_2xoQRgm_zbhNArTTvQ"
    ],
    "4": [
      "hsy1xN8aRDqBNhex3kjqNA",
      "yzp3uT-OSlWPRFdvIhCbKA"
    ]
  }
}
matthewfranglen commented 3 years ago

That's great, thank you.

matthewfranglen commented 3 years ago

I wonder if the problem is this field definition:

"testbytea": {
  "type": "text",
  "fields": {
    "keyword": {
      "ignore_above": 256,
      "type": "keyword"
    }
  }
}

It looks like it should have a type of binary. The current fdw would not work with it as it needs to encode the data as base64. I think that even with that change the schema will need to be created manually rather than using the automatic creation because the elasticsearch definition would still be text.

I'll see what I can do about fixing this on the postgres side.

houzi1099857978 commented 3 years ago

I wonder if the problem is this field definition:

"testbytea": {
  "type": "text",
  "fields": {
    "keyword": {
      "ignore_above": 256,
      "type": "keyword"
    }
  }
}

It looks like it should have a type of binary. The current fdw would not work with it as it needs to encode the data as base64. I think that even with that change the schema will need to be created manually rather than using the automatic creation because the elasticsearch definition would still be text.

I'll see what I can do about fixing this on the postgres side.

I change the es index type to binary,but the problem still exsist,maybe it is caused by php conventer.

matthewfranglen commented 3 years ago

The problem is in the PG FDW. Elasticsearch expects the data to be base64 encoded.

I need to add a data conversion layer for the more complex types, which I've been putting off. Doing that should also allow me to address the GeoField ticket. Sorry about the delay.

houzi1099857978 commented 3 years ago

Thanks for help,I find a way to solve the problem. when I login python print the default system encoding is ascill

linux_dtq4:~ # python 
Python 2.7.13 (default, Jan 11 2017, 10:56:06) [GCC] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.getdefaultencoding()
ascii

So I changed the system encoding by create a sitecustomize.py in python Lib\site-packages:

cd /usr/lib/python2.7/site-packages/
vi sitecustomize.py

# encoding=utf8 
import sys
reload(sys)
sys.setdefaultencoding('utf8')

reboot the system,it works!