palewire / django-postgres-copy

Quickly import and export delimited data with Django support for PostgreSQL's COPY command
https://palewi.re/docs/django-postgres-copy/
MIT License
178 stars 48 forks source link

optionally return PKs? #168

Open hottwaj opened 1 year ago

hottwaj commented 1 year ago

Would it be possible to provide an option to return PKs of inserted objects?

Given that INSERT is used to put data into the destination table after COPY to the temporary table, it looks like it should be possible via a "RETURNING id" (or whatever the PK field name is)?

Thanks! :)

geoffrey-eisenbarth commented 1 year ago

Here's a bit of code that will do that. I had considered adding this into my PR #156 but that PR seems to have stalled.

If there's an interest in adding this functionality, I don't mind submitting a separate PR, @palewire

First, need to subclass CopyMapping:

from postgres_copy import CopyMapping

class ResultsCopyMapping(CopyMapping):
  def insert_suffix(self) -> str:
    """Add `RETURNING` clause to get newly created/updated ids."""
    suffix = super().insert_suffix()
    suffix = suffix.split(';')[0] + ' RETURNING id;'
    return suffix

  def post_insert(self, cursor) -> None:
    """Extend to retrieve results from `RETURNING` clause."""
    self.obj_ids = [r[0] for r in cursor.fetchall()]

Unfortunately, I don't believe it's possible to implement this "ResultsCopyMapping" the "normal way" (e.g., by putting objects = CopyManager() on your model), so it has to be called explicitly by subclassing CopyQuerySet:

from postgres_copy import CopyQuerySet

class ResultsCopyQuerySet(CopyQuerySet):
  def from_csv(self, csv_path_or_obj, mapping=None, **kwargs):
    mapping = ResultsCopyMapping(self.model, csv_path_or_obj, mapping=None, **kwargs)
    count = mapping.save(silent=True)
    objs = self.model.objects.filter(id__in=mapping.obj_ids)
    return objs, count

Then, finally, on your model you do

class MyModel(models.Model):
  [...]
  objects = ResultsCopyQuerySet.as_manager()

This code is slightly different than the implementation I have that's working for me, so let me know if you have any issues with it!

hottwaj commented 1 year ago

thanks! That's really helpful :)

If an API is added for this, my vote would be for that to be able to return either i) raw IDs (i.e. avoid the SELECT in your code above) or ii) full ORM instances (as in your example)

Also I think option ii) could be made more efficient by selecting the last N rows in the table, where N is the number of rows inserted, rather than doing the selection by id__in as in your code? would need the SELECT to be within same transaction and ordering to be by PK for that to work...

geoffrey-eisenbarth commented 1 year ago

Definitely! In my case, just grabbing the last N rows in the table wouldn't work because I'm using code that allows "upsert," so it might not always be the case that the rows imported via CSV match the most recent rows in the db.

If you just want to get the ids, just change from_csv to

  def from_csv(self, csv_path_or_obj, mapping=None, **kwargs):
    mapping = ResultsCopyMapping(self.model, csv_path_or_obj, mapping=None, **kwargs)
    count = mapping.save(silent=True)
    return mapping.obj_ids
hottwaj commented 1 year ago

aha thanks I did not consider how the "last N" would interact with "upsert" :)

hope some progress on that PR can made too

palewire commented 1 year ago

What would the PR be? Adding a post_insert hook?

geoffrey-eisenbarth commented 1 year ago

What would the PR be? Adding a post_insert hook?

I'm not 100% on the best implementation API-wise. The necessary hooks are all already there (insert_suffix and post_insert), but I suppose we could add a kwarg to from_csv, something like returning , that would allow the user to specify count (default), ids, or objs (although I'm not sure if it's frowned upon to add kwargs that accept a limited number of expected string values).

palewire commented 1 year ago

Gotcha. If this is simply a common extra bit to slap into the hook, I don't think it needs to be integrated into the repo. Sharing the snippet seems like enough. I'd be happy to include the example in the docs as a cookbook style snippet.

geoffrey-eisenbarth commented 1 year ago

Yeah, I agree that's probably the best approach. I don't mind starting a PR on it.

What do you think about appending it to the "Extending with hooks" section?

hottwaj commented 1 year ago

I understand if you would prefer not to maintain this in the repo (e.g. due to time constraints, not being core functionality etc).

my opinion (which you are welcome to ignore :wink: ) is that as this functionality is provided by django out of the box it would make this repo stronger by doing the same instead of providing a harder-to-find code snippet containing SQL manipulation...

as I said just my opinion - obviously just trying to change your mind - and understand if you decide against. Thanks!

palewire commented 1 year ago

Can you point me to where Django implements it? If we can include something that 100% matches a public API in Django I would reconsider.

hottwaj commented 1 year ago

django's QuerySet.bulk_create does something similar to the snippet @geoffrey-eisenbarth first provided: it returns the inserted objects with their PK set to the value in the database. See https://docs.djangoproject.com/en/4.2/ref/models/querysets/#bulk-create

Actually the django docs are not very clear that the PK is set on the returned objects, but it is :)

Relevant django ticket for this is here: https://code.djangoproject.com/ticket/19527

Thanks!