matthewfranglen / postgres-elasticsearch-fdw

Postgres to Elastic Search Foreign Data Wrapper
MIT License
108 stars 32 forks source link

Could it support the `RETURNING` ? #13

Closed aileo closed 3 years ago

aileo commented 3 years ago

First, thank you for this awesome project ! Easy to set up and works very well.

I use RETURNING * almost every time I do an INSERT in postgres to retrieve the real inserted data (e.g: with default values). When using it on a foreign table using this foreign data wrapper, I get an empty row. Is it the expected behavior? If not, even if I am very new to FDW and python, I could try to fix it.

matthewfranglen commented 3 years ago

Hi,

I'm glad that you are able to use the fdw so easily :smile:

Looking at your suggestion and reviewing the elasticsearch index api (which is used to add documents to elasticsearch - here) it does not appear possible to return the full indexed document in the response when indexing documents. Instead you would have to collect the id (and optionally the version) from the response and then read the documents.

I would welcome any PR to add this support, and I am happy to help you write it. I think that such a PR must return the data if the RETURNING keyword is present and must not make additional requests if the RETURNING keyword is not present. It would be nice if you filtered the source by the requested fields, documented this behaviour in the README, added tests, and if it worked on all supported versions of elasticsearch (5-7).

If you look at this code you can see how to define a test. The test really just establishes that the query can execute without error - it does not check the returned value. While this is weak I have found this to be sufficient so far.

I hope this doesn't sound like too much work - remember I am happy to help.

aileo commented 3 years ago

I hope this doesn't sound like too much work

Sounds like a lot of work but fair regarding quality checks :)

I think that such a PR must return the data if the RETURNING keyword is present and must not make additional requests if the RETURNING keyword is not present.

I am not sure as I just started reading the doc but this part make me think it would not be possible that way. I did not find a way to know if the query have a RETURNING statement and which columns are involved.

Genuine question: could data indexed by Elasticsearch be different from the one from the INSERT query? If not, is it still relevant to query the document or could it just return the incoming values?

aileo commented 3 years ago

If not, is it still relevant to query the document or could it just return the incoming values?

Nevermind, it would not work for updates

matthewfranglen commented 3 years ago

So I'm looking into this a little and there does not seem to be support for default fields in elasticsearch. There is however the possibility to run scripts when updating (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#upserts and https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html). The id field can also be auto generated. This means that the document values can differ from the data that was inserted, although I think that scripts cannot be invoked through the foreign data wrapper.

More broadly though elasticsearch is not guaranteed to make the changes visible to subsequent queries. The default behaviour is to write the document to the primary shard and then report success. At that point the primary shard is responsible for replicating the data to other shards. Should the primary shard fail at this point then that replication could fail and subsequent queries would not see the updated values. You can read a bit about this here: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-refresh.html

Finally we can use RETURNING to return values which are not explicitly defined in the insert. So for this to be a perfect solution it would require querying the elasticsearch cluster for the data, and that data cannot be guaranteed to be what was inserted (partly because of staleness issues, above, but also because it may've been updated subsequently). Even the imperfect solution of returning what was passed to the fdw suffers from these issues and may mislead the user.

As such I feel that this cannot be well implemented. :cry: I'm very interested in your thoughts on this matter.

For your specific needs perhaps you could use a CTE that queries after inserting?

matthewfranglen commented 3 years ago

I could see restricting RETURNING to the id of the inserted document(s) only being useful.

aileo commented 3 years ago

As such I feel that this cannot be well implemented. I'm very interested in your thoughts on this matter.

You exposed the situation very well, trying to get a consistent return without a performance drop looks like a dead end.

The best I can think of is, as you proposed, to return only the ID:

A wild question appears: Should an optional "version" column be added then?

Another approach would be "I dont care about performance", maybe by adding a parameter like accurate_return to the foreign table, it would:

As it would be per foreign table, it would be possible to reach the same index both ways depending on the use case.

For your specific needs perhaps you could use a CTE that queries after inserting?

For the context, I am using ES for ... search so I store one part of the same item in PG and the other part (searchable text) in ES. I try to do both operations in the same query (otherwise I would not need a FDW) I tried your proposal first and it did not work as it seems you can't query what you just inserted in the same query. My current solution use CTE and is not 100% reliable:

matthewfranglen commented 3 years ago

I think that making the id available is a good idea. I do not think that the version field has great value - it is not exposed via the fdw at the moment and no one has ever requested it.

For the parameter I think there is an argument to be made for three possible values:

To reduce confusion it would be good to name this option something similar to the elasticsearch parameter (refresh). Perhaps the three values could be no-refresh, refresh, and return.

For implementing the return it should be sufficient to invoke the existing read code, passing the list of ids to read.

aileo commented 3 years ago

I wonder if it would be better to have separated parameters:

For implementing the return it should be sufficient to invoke the existing read code, passing the list of ids to read.

I am not that far yet, I would like to, succesfully, run the tests first :smile:

matthewfranglen commented 3 years ago

https://github.com/matthewfranglen/postgres-elasticsearch-fdw/blob/master/Makefile#L38

poetry install
poetry run tests/run.py --pg 9.4 9.5 9.6 10 11 12 --es 5 6 6-auth 7

This is how you run the tests, and is what happens when you run make test. It tests the product of the postgres and elasticsearch versions so it does take quite a while to complete. I would recommend running with a single pair (e.g. pg-12 and es-7) while developing and then run the full tests when you think you have something good.

This uses poetry for dependency management: https://python-poetry.org/

matthewfranglen commented 3 years ago

With the merge of the associated PR I believe that this ticket is resolved. I'm going to run all the tests (again) and then release a new version.

matthewfranglen commented 3 years ago

Ah the two PRs you opened were merged to master. I use git flow layout for this project so I have reset all that to develop. If you choose to work on this project further you may have to take some care when pulling.

matthewfranglen commented 3 years ago

https://pypi.org/project/pg-es-fdw/ version 0.10.0 has been released. I've documented the new functionality under Refresh and RETURNING.

aileo commented 3 years ago

work like a charm

aileo commented 3 years ago

Note: returning on DELETE is not implemented. I do not need it but it is supported by Postgres.

matthewfranglen commented 3 years ago

Oh that's a really good point. I'm reopening this to remind me to address that!

The implementation of that would be slightly different as you can delete with id which does not return the documents, and you can't read the documents after deleting them.

aileo commented 3 years ago

I think it could look like this:

    def delete(self, document_id):
        """ Delete documents from Elastic Search """

        if self.complete_returning:
            document = self._read_by_id(document_id)

        try:
            response = self.client.delete(id=document_id, refresh=self.refresh, **self.arguments)
            if self.complete_returning:
                return document
            return {self.rowid_column: document_id}
        except Exception as exception:
            log2pg(
                "DELETE for {path}/{document_id} failed: {exception}".format(
                    path=self.path, document_id=document_id, exception=exception
                ),
                logging.ERROR,
            )
            return (0, 0)
matthewfranglen commented 3 years ago

Right think I have resolved this now. Released 0.10.1 to add RETURNING to DELETE in a way very similar to what you suggested. Once again I think that this ticket is resolved.

aileo commented 3 years ago

I would say the DELETE method still miss the refresh query parameter to match the behavior of the other two methods (document may be still searchable until index refresh).

Regarding the main subject of RETURNING, it can be considered closed.

matthewfranglen commented 3 years ago

I'm glad that you are paying attention! I've added support for that in 0.10.2. I would really appreciate another review, if you have time.

aileo commented 3 years ago

Looks good to me. Thanks !