matthewfranglen / postgres-elasticsearch-fdw

Postgres to Elastic Search Foreign Data Wrapper
MIT License
111 stars 32 forks source link

Cast array or object property as json Postgres type #6

Closed ahocquard closed 4 years ago

ahocquard commented 4 years ago

Hello,

Is it possible to cast an array as json easily with the extension? Same thing for json objects. As far I know and by reading the Python code of the extension, it casts it as a Python string.

For example, the ES document:

        {
            "browser_name": "Chrome",
            "user_os_name": "Mac OS X",
            "email_domains": ["example.com", "example2.com"],
            "geoip": {
                "ip": "203.168.53.187",
                "latitude": -37.734,
                "location": {
                    "lat": -37.734,
                    "lon": 144.7378
                }
            }
        }

The foreign table definition (JSON for geoip or domain_names does not work):

CREATE FOREIGN TABLE es_access_log
(
    id TEXT,
    query TEXT,
    geoip TEXT,
    browser_name TEXT,
    user_os_name TEXT
    domain_names TEXT
)
SERVER multicorn_es
OPTIONS
(
    host 'elasticsearch',
        port '9200',
        index 'logstash-nginx',
        rowid_column 'id',
        query_column 'query',
        score_column 'score',
        timeout '20',
        scroll_size '1000',
        duration '30m'
    )
;
select email_domains from my_foreign_table;

[u'example.com', u'example2.com']
select geoip from my_foreign_table;

"ip"=>"203.168.53.187", "latitude"=>-37.734, "location"=>"lat"=>-37.734, "lon"=>144.7378

What I would like is to declare it as json or jsonb in order to be cast automatically by the wrapper. Did I miss something? Is it supported?

matthewfranglen commented 4 years ago

Hi,

Thanks for opening this issue. This sounds like a good idea.

The code performs no explicit coercion of the data from the elastic search query, however as you have noticed it gets turned into varying string representations (in your example it is array->array and dictionary->hstore). When I looked up this in the multicorn I found this issue which confirms the behaviour you have experienced.

I think that the best thing I can do quickly is to ensure that the string representation of the json is always returned, which can then be cast into json or jsonb. For your specific use case it may require a view over the foreign table to apply a cast (when I've implemented this change I can test if this is required).

What do you think?

matthewfranglen commented 4 years ago

I also wonder what behaviour would be appropriate when writing to such a column.

ahocquard commented 4 years ago

I think that the best thing I can do quickly is to ensure that the string representation of the json is always returned, which can then be cast into json or jsonb

That would be very handy indeed.

In my case, if a jsonb type does not work natively, I can still do:

CREATE TABLE IF NOT EXISTS access_log AS SELECT *, domain_names::jsonb as json_domain_names FROM es_access_log;

And if declaring my column as jsonb instead of text works automatically, great!

I also wonder what behaviour would be appropriate when writing to such a column.

A json representation would be the best no? Or I misunderstood what you mean.

matthewfranglen commented 4 years ago

With regard to writes - it's likely that the data provided via the multicorn library is always a string. That's something that I need to test. The elasticsearch instance would refuse anything that was not a completely formed json object consistent with the schema. Since my original use case for this was flat documents where columns were simple values this was not a problem. Ensuring that the write of a structured column is converted appropriately is the concern.

With regard to the casting of the output column I would remind you that there is no caching going on here. You would be paying the overhead of conversion to jsonb every time. I would recommend testing the performance of the two types carefully. A materialized view might help if you need to operate over the blob.

I should be able to work on this a bit tonight.

ahocquard commented 4 years ago

With regard to the casting of the output column I would remind you that there is no caching going on here. You would be paying the overhead of conversion to jsonb every time. I would recommend testing the performance of the two types carefully. A materialized view might help if you need to operate over the blob.

Yes, casting is done at runtime which is a barrier optimization in many cases. But in the previous example, I added a new jsonb column to convert the text column on the fly when importing data from ES. Then, in Postgres I can use the jsonb column instead of the TEXT column. I only pay the cost of the cast at import time, which is very probably negligible.

Anyway, performance is not a big concern for my usecase .

I should be able to work on this a bit tonight. Thanks for the reactivity! If I can help somehow, tell me.

matthewfranglen commented 4 years ago

Oh this is a really great idea. I've created a jsonb branch for you to test out properly. My initial evaluation shows that for reading it works with the json/jsonb type applied to the foreign table directly.

I've listed instructions to reproduce my test run on the last commit. It would be good if you could check the branch version and report any issues you find, especially since you want to use this in anger.

ahocquard commented 4 years ago

Hello,

I launched my testsuite successfully with your branch! I had good hope as timestamp and inet types were automatically casted. Also, it's possible to generate a valid json this way now (subquery mandatory):

select row_to_json(access_log) from (select * from es_access_log) access_log;

Thanks!

matthewfranglen commented 4 years ago

That's great :smile:

I need to add proper tests around this, implement the write behaviour and document it all. After that I'll release a new version. I'm hoping to get that done over the easter weekend.

ahocquard commented 4 years ago

Cool! Thanks a lot for this improvement :)

matthewfranglen commented 4 years ago

I've completed the test/documentation updates and released the new version.

If you have any further suggestions please let me know, otherwise I feel this issue can be closed.

ahocquard commented 4 years ago

Thanks for the new version (and for the support of Postgres 12). It works as expected!

matthewfranglen commented 4 years ago

Great stuff :smiley_cat: