googleapis / python-spanner-django

Cloud Spanner database backend for Django
BSD 3-Clause "New" or "Revised" License
90 stars 28 forks source link

feat: interleaved tables support #618

Open vi3k6i5 opened 3 years ago

vi3k6i5 commented 3 years ago

(@c24t edit)

Spanner recommends using interleaved tables for parent-child relationships. Interleaving is a special case foreign key relationship that guarantees child and parent rows are co-located in Spanner. Django has no concept of interleaved tables.

Some investigation using the example from the cloud spanner docs:

CREATE TABLE Singers (
  SingerId   INT64 NOT NULL,
  FirstName  STRING(1024),
  LastName   STRING(1024),
  SingerInfo BYTES(MAX),
) PRIMARY KEY (SingerId);

CREATE TABLE Albums (
  SingerId     INT64 NOT NULL,
  AlbumId      INT64 NOT NULL,
  AlbumTitle   STRING(MAX),
) PRIMARY KEY (SingerId, AlbumId),
  INTERLEAVE IN PARENT Singers ON DELETE CASCADE;

CREATE TABLE Songs (
  SingerId     INT64 NOT NULL,
  AlbumId      INT64 NOT NULL,
  TrackId      INT64 NOT NULL,
  SongName     STRING(MAX),
) PRIMARY KEY (SingerId, AlbumId, TrackId),
  INTERLEAVE IN PARENT Albums ON DELETE CASCADE;

inspectdb gives us reasonable models, but not immediately usable:

$ python manage.py inspectdb
# This is an auto-generated Django model module.
# You'll have to do the following manually to clean this up:
#   * Rearrange models' order
#   * Make sure each model has one field with primary_key=True
#   * Make sure each ForeignKey has `on_delete` set to the desired behavior.
#   * Remove `managed = False` lines if you wish to allow Django to create, modify, and delete the table
# Feel free to rename the models, but don't rename db_table values or field names.
from django.db import models

class Albums(models.Model):
    singerid = models.IntegerField(db_column='SingerId', primary_key=True)  # Field name made lowercase.
    albumid = models.IntegerField(db_column='AlbumId')  # Field name made lowercase.
    albumtitle = models.TextField(db_column='AlbumTitle', blank=True, null=True)  # Field name made lowercase.

    class Meta:
        managed = False
        db_table = 'Albums'
        unique_together = (('singerid', 'albumid'), ('singerid', 'albumid'),)

class Singers(models.Model):
    singerid = models.IntegerField(db_column='SingerId', primary_key=True)  # Field name made lowercase.
    firstname = models.CharField(db_column='FirstName', max_length=1024, blank=True, null=True)  # Field name made lowercase.
    lastname = models.CharField(db_column='LastName', max_length=1024, blank=True, null=True)  # Field name made lowercase.
    singerinfo = models.BinaryField(db_column='SingerInfo', blank=True, null=True)  # Field name made lowercase.

    class Meta:
        managed = False
        db_table = 'Singers'

class Songs(models.Model):
    singerid = models.IntegerField(db_column='SingerId', primary_key=True)  # Field name made lowercase.
    albumid = models.IntegerField(db_column='AlbumId')  # Field name made lowercase.
    trackid = models.IntegerField(db_column='TrackId')  # Field name made lowercase.
    songname = models.TextField(db_column='SongName', blank=True, null=True)  # Field name made lowercase.

    class Meta:
        managed = False
        db_table = 'Songs'
        unique_together = (('singerid', 'albumid', 'trackid'), ('singerid', 'albumid', 'trackid'),)

Some cleanup: keep the unique_together constraints, lose Songs.singerid, add FKs from Albums to Singers and Songs to Albums:

from django.db import models

class Singers(models.Model):
    singerid = models.IntegerField(db_column='SingerId', primary_key=True)
    firstname = models.CharField(db_column='FirstName', max_length=1024, blank=True, null=True)
    lastname = models.CharField(db_column='LastName', max_length=1024, blank=True, null=True)
    singerinfo = models.BinaryField(db_column='SingerInfo', blank=True, null=True)

    class Meta:
        managed = False
        db_table = 'Singers'

class Albums(models.Model):
    albumid = models.IntegerField(db_column='AlbumId', primary_key=True)
    singer = models.ForeignKey(Singers, db_column='SingerId', on_delete=models.CASCADE)
    albumtitle = models.TextField(db_column='AlbumTitle', blank=True, null=True)

    class Meta:
        managed = False
        db_table = 'Albums'
        unique_together = (('singer', 'albumid'), ('singer', 'albumid'),)

class Songs(models.Model):
    trackid = models.IntegerField(db_column='TrackId', primary_key=True)
    album = models.ForeignKey(Albums, db_column='AlbumId', on_delete=models.CASCADE)
    songname = models.TextField(db_column='SongName', blank=True, null=True)

    class Meta:
        managed = False
        db_table = 'Songs'
        unique_together = (('album', 'trackid'), ('album', 'trackid'),)

Using these models, writing a row to the interleaved child table fails:

from myapp import models

s1 = models.Singers(singerid=1, firstname="Pink", lastname="Floyd")
a11 = models.Albums(albumid=11, singer=s1, albumtitle="Wish You Were Here")
t111 = models.Songs(album=a11, trackid=111, songname="Have a Cigar")

s1.save()  # Works: <Singers: Singers object (1)>
a11.save()  # Fails: ProgrammingError: 400 Key column SingerId cannot be updated.

The error we get back from Spanner is INVALID_ARGUMENT, which gets ​surfaced as a ProgrammingError in the spanner python client. The offending gRPC call to google.spanner.v1.Spanner/ExecuteSql has args:

sql: "UPDATE Albums SET SingerId = @a0, AlbumTitle = @a1 WHERE Albums.AlbumId = @a2"
params: {
 ​fields: [
   ​{key: "a0" value {string_value: "1"}},
   ​{key: "a1" value {string_value: "Wish You Were Here"}},
   ​{key: "a2" value {string_value: "11"}},
]}
param_types: [
 ​{key: "a0" value {code: INT64}},
 ​{key: "a1" value {code: STRING}},
 ​{key: "a2" value {code: INT64}},
]

This makes sense: it shouldn't be possible to change SingerId after creating the album since this would break the parent-child relationship. But why are we doing an update in the first place?

When we save the model, django tries doing an update first, and then an insert if there was no existing row to update.

If we skip the call to _do_update in django, the following insert actually works as expected. The request is the same as above, but with a different sql arg:

sql: "INSERT INTO Albums (AlbumId, SingerId, AlbumTitle) VALUES (@a0, @a1, @a2)"`

This suggests we might be able to make interleaved tables behave by adding some special handling in SQLUpdateCompiler and SQLInsertCompiler, and adding something like an InterleavedForeignKey field so we can treat parent-child relationships differently than regular FKs in the compiler.

There may be a simpler approach, and there are almost definitely other complications that I haven't considered here, but I think it's good news that we can treat interleaved parents essentially as FKs and still use the basic django model machinery.

@michi88 let me know what you think, if you've got another use case in mind I'd be interested to hear it.

michi88 commented 3 years ago

Hi all!

Is there a plan on how we plan to support this?

c24t commented 3 years ago

Hey @michi88, thanks for taking a look. I updated the description with more info, the next step is to decide how we want to expose interleaved tables in this package's API and write some test cases. I suggested adding an InterleavedForeignKey field above but there may be better alternatives.

c24t commented 3 years ago

From a conversation with @michi88:

Another approach is to add a Meta.interleave_in_parent option, e.g. in the Songs model:

interleave_in_parent = (Albums, ('singerid', 'albumid', 'trackid'))

We could use this with or without a new InterleavedForeignKey field.

We need to decide whether to (a) have one parent ID field (as in the example above) or (b) separate fields for each ancestor. (a) makes for smaller models but requires some introspection for operations like inserts that require ancestors' IDs, (b) is more explicit but may mean doing some extra integrity checks.

ansh0l commented 2 years ago

Spanner Limitations added via PR lists some workaround for folks who are blocked on this. Keeping this issue open for tracking reasons.

ansh0l commented 2 years ago

Bumping the priority down to p2 since workaround exists. @asthamohta : Please take it forward and feel free to revert in case that's not the right direction.