palewire / django-postgres-copy

Quickly import and export delimited data with Django support for PostgreSQL's COPY command
https://palewi.re/docs/django-postgres-copy/
MIT License
180 stars 48 forks source link

interest in support for in-memory data? #25

Open jamesturk opened 8 years ago

jamesturk commented 8 years ago

I wanted to check if a patch to support something like:

CopyMapping(model=Entry, data=huge_list_of_unsaved_entries) would be a welcome patch

I have some code I was considering making into a library that uses pg COPY to replace Django's bulk_create in the form of a function w/ the signature:

def postgres_bulk_copy(objects, ModelCls=None,
                       table_name=None,
                       fields=None, not_null=None, null=None,
                       ):

I was considering abstracting this into a generic library to use Postgres' COPY TO/FROM when I found this library & figured I'd check if you'd be open to expanding the scope for this use case.

palewire commented 8 years ago

Maybe! Just to make sure I understand, the idea is to insert data into an existing table managed by a Django model?

jamesturk commented 8 years ago

Yep, the way I have it working is that you instantiate a bunch of objects of the model type. It'd look something like:

entries = []
for data in data_from_some_source():
     # by instantiating Entry here we get a few things like proper defaults
     entries.append(Entry(**item))

# slow way
Entry.objects.bulk_create(entries)

# fast way
CopyMapping(model=Entry, data=entries).save()

& internally it'd essentially create a temporary CSV file w/ all of the data then load it via the same COPY mechanism.

palewire commented 8 years ago

Let's do it!

On Fri, Jul 8, 2016, 9:17 PM James Turk notifications@github.com wrote:

Yep, the way I have it working is that you instantiate a bunch of objects of the model type. It'd look something like:

entries = [] for data in data_from_some_source():

by instantiating Entry here we get a few things like proper defaults

 entries.append(Entry(**item))

slow way

Entry.objects.bulk_create(entries)

fast way

CopyMapping(model=Entry, data=entries).save()

& internally it'd essentially create a temporary CSV file w/ all of the data then loads it via the same COPY mechanism.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/california-civic-data-coalition/django-postgres-copy/issues/25#issuecomment-231514075, or mute the thread https://github.com/notifications/unsubscribe/AAAnCb64uIgiRNztsEcF5WNzf2ay7esfks5qTyDfgaJpZM4JIfOl .

virusdefender commented 7 years ago

Is there any progress?

denhartog commented 6 years ago

I thought about this when I first came across this project a few weeks ago. If I remember correctly, the in-memory object would need a read() and readlines() method based on psycopg2's copy_expert() method (see: https://github.com/psycopg/psycopg2/blob/6da3e7ee69971cd6cb692765a4d66a5ce405f104/psycopg/cursor_type.c#L1366)

jamesturk commented 6 years ago

I never got a chance to come back & contribute what I had, but here it is as a POC, in case someone else wants to pick this up.

import csv
from io import StringIO
from django.db import connection

def postgres_bulk_copy(objects, ModelCls=None,
                       table_name=None,
                       fields=None, not_null=None, null=None,
                       include_id=False,
                       ):
    _fields = []
    _not_null = []
    _null = []

    # if a model was passed in, use it to get defaults
    if ModelCls:
        for f in ModelCls._meta.get_fields():
            if (not f.auto_created and
                    not (f.is_relation and f.many_to_many) and
                    f.column):
                _fields.append(f.column)
                if f.null:
                    _null.append(f.column)
                else:
                    _not_null.append(f.column)

        # prefer passed in overrides if they exist
        table_name = table_name or ModelCls._meta.db_table
        fields = fields or _fields
        not_null = not_null or _not_null
        null = null or _null
        if include_id:
            fields.append('id')

        objects = [
            {k: getattr(o, k) for k in fields} for o in objects
        ]

    tmp = StringIO()
    w = csv.DictWriter(tmp, fieldnames=fields)
    w.writeheader()
    w.writerows(objects)

    # flush and seek to start
    tmp.flush()
    tmp.seek(0)

    cursor = connection.cursor()
    sql = "COPY {}({}) FROM STDIN WITH CSV HEADER".format(table_name,
                                                          ', '.join(fields)
                                                          )
    if null:
        sql += " FORCE NULL {}".format(', '.join(null))
    if not_null:
        sql += " FORCE NOT NULL {}".format(', '.join(not_null))

    cursor.copy_expert(sql, tmp)
    # need this commit here so lookups immediately after will work
    cursor.connection.commit()
palewire commented 6 years ago

Thanks for sharing your work, @jamesturk. If anybody wants to take a run at a pull request it would be greatly appreciated.

palewire commented 6 years ago

I think it's only partial progress towards the goal here, but I just merged #68, which adds support for submitting file objects in addition to file paths.