Closed aileo closed 3 years ago
Hi,
I'm glad that you are able to use the fdw so easily :smile:
Looking at your suggestion and reviewing the elasticsearch index api (which is used to add documents to elasticsearch - here) it does not appear possible to return the full indexed document in the response when indexing documents. Instead you would have to collect the id (and optionally the version) from the response and then read the documents.
I would welcome any PR to add this support, and I am happy to help you write it. I think that such a PR must return the data if the RETURNING
keyword is present and must not make additional requests if the RETURNING
keyword is not present. It would be nice if you filtered the source by the requested fields, documented this behaviour in the README, added tests, and if it worked on all supported versions of elasticsearch (5-7).
If you look at this code you can see how to define a test. The test really just establishes that the query can execute without error - it does not check the returned value. While this is weak I have found this to be sufficient so far.
I hope this doesn't sound like too much work - remember I am happy to help.
I hope this doesn't sound like too much work
Sounds like a lot of work but fair regarding quality checks :)
I think that such a PR must return the data if the
RETURNING
keyword is present and must not make additional requests if theRETURNING
keyword is not present.
I am not sure as I just started reading the doc but this part make me think it would not be possible that way. I did not find a way to know if the query have a RETURNING
statement and which columns are involved.
Genuine question: could data indexed by Elasticsearch be different from the one from the INSERT
query? If not, is it still relevant to query the document or could it just return the incoming values?
If not, is it still relevant to query the document or could it just return the incoming values?
Nevermind, it would not work for updates
So I'm looking into this a little and there does not seem to be support for default fields in elasticsearch. There is however the possibility to run scripts when updating (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#upserts and https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html). The id field can also be auto generated. This means that the document values can differ from the data that was inserted, although I think that scripts cannot be invoked through the foreign data wrapper.
More broadly though elasticsearch is not guaranteed to make the changes visible to subsequent queries. The default behaviour is to write the document to the primary shard and then report success. At that point the primary shard is responsible for replicating the data to other shards. Should the primary shard fail at this point then that replication could fail and subsequent queries would not see the updated values. You can read a bit about this here: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-refresh.html
Finally we can use RETURNING
to return values which are not explicitly defined in the insert. So for this to be a perfect solution it would require querying the elasticsearch cluster for the data, and that data cannot be guaranteed to be what was inserted (partly because of staleness issues, above, but also because it may've been updated subsequently). Even the imperfect solution of returning what was passed to the fdw suffers from these issues and may mislead the user.
As such I feel that this cannot be well implemented. :cry: I'm very interested in your thoughts on this matter.
For your specific needs perhaps you could use a CTE that queries after inserting?
I could see restricting RETURNING
to the id of the inserted document(s) only being useful.
As such I feel that this cannot be well implemented. I'm very interested in your thoughts on this matter.
You exposed the situation very well, trying to get a consistent return without a performance drop looks like a dead end.
The best I can think of is, as you proposed, to return only the ID:
A wild question appears: Should an optional "version" column be added then?
Another approach would be "I dont care about performance", maybe by adding a parameter like accurate_return
to the foreign table, it would:
refresh=true
or refresh=wait_for
to the insert/update queryAs it would be per foreign table, it would be possible to reach the same index both ways depending on the use case.
For your specific needs perhaps you could use a CTE that queries after inserting?
For the context, I am using ES for ... search so I store one part of the same item in PG and the other part (searchable text) in ES. I try to do both operations in the same query (otherwise I would not need a FDW) I tried your proposal first and it did not work as it seems you can't query what you just inserted in the same query. My current solution use CTE and is not 100% reliable:
I think that making the id available is a good idea. I do not think that the version field has great value - it is not exposed via the fdw at the moment and no one has ever requested it.
For the parameter I think there is an argument to be made for three possible values:
To reduce confusion it would be good to name this option something similar to the elasticsearch parameter (refresh
). Perhaps the three values could be no-refresh
, refresh
, and return
.
For implementing the return it should be sufficient to invoke the existing read code, passing the list of ids to read.
I wonder if it would be better to have separated parameters:
refresh
(true
/ wait_for
/ false
): default to false
, just like ES does so there is no possible confusion.return
(id
/ document
): default to id
, define data to return, could be document
only when refresh
is true
or wait_for
. It could throw an error or fallback to id
when refresh
is not in compatible mode.For implementing the return it should be sufficient to invoke the existing read code, passing the list of ids to read.
I am not that far yet, I would like to, succesfully, run the tests first :smile:
https://github.com/matthewfranglen/postgres-elasticsearch-fdw/blob/master/Makefile#L38
poetry install
poetry run tests/run.py --pg 9.4 9.5 9.6 10 11 12 --es 5 6 6-auth 7
This is how you run the tests, and is what happens when you run make test
. It tests the product of the postgres and elasticsearch versions so it does take quite a while to complete. I would recommend running with a single pair (e.g. pg-12 and es-7) while developing and then run the full tests when you think you have something good.
This uses poetry
for dependency management: https://python-poetry.org/
With the merge of the associated PR I believe that this ticket is resolved. I'm going to run all the tests (again) and then release a new version.
Ah the two PRs you opened were merged to master. I use git flow layout for this project so I have reset all that to develop. If you choose to work on this project further you may have to take some care when pulling.
https://pypi.org/project/pg-es-fdw/ version 0.10.0 has been released. I've documented the new functionality under Refresh and RETURNING
.
work like a charm
Note: returning on DELETE is not implemented. I do not need it but it is supported by Postgres.
Oh that's a really good point. I'm reopening this to remind me to address that!
The implementation of that would be slightly different as you can delete with id which does not return the documents, and you can't read the documents after deleting them.
I think it could look like this:
def delete(self, document_id):
""" Delete documents from Elastic Search """
if self.complete_returning:
document = self._read_by_id(document_id)
try:
response = self.client.delete(id=document_id, refresh=self.refresh, **self.arguments)
if self.complete_returning:
return document
return {self.rowid_column: document_id}
except Exception as exception:
log2pg(
"DELETE for {path}/{document_id} failed: {exception}".format(
path=self.path, document_id=document_id, exception=exception
),
logging.ERROR,
)
return (0, 0)
self.complete_returning
(to match insert/update behavior)Right think I have resolved this now. Released 0.10.1 to add RETURNING to DELETE in a way very similar to what you suggested. Once again I think that this ticket is resolved.
I would say the DELETE method still miss the refresh
query parameter to match the behavior of the other two methods (document may be still searchable until index refresh).
Regarding the main subject of RETURNING
, it can be considered closed.
I'm glad that you are paying attention! I've added support for that in 0.10.2. I would really appreciate another review, if you have time.
Looks good to me. Thanks !
First, thank you for this awesome project ! Easy to set up and works very well.
I use
RETURNING *
almost every time I do anINSERT
in postgres to retrieve the real inserted data (e.g: with default values). When using it on a foreign table using this foreign data wrapper, I get an empty row. Is it the expected behavior? If not, even if I am very new to FDW and python, I could try to fix it.