piccolo-orm / piccolo

A fast, user friendly ORM and query builder which supports asyncio.
https://piccolo-orm.com/
MIT License
1.39k stars 88 forks source link

Piccolo doesn't handle duplicate entries automatically for in .add_m2m #707

Open Drapersniper opened 1 year ago

Drapersniper commented 1 year ago

Discussed in https://github.com/piccolo-orm/piccolo/discussions/683

Originally posted by **Drapersniper** November 21, 2022 ```py class Track(Table): track_id = BigInt(primary_key=True) class Playlist(Table): id = BigInt(primary_key=True) tracks = M2M(LazyTableReference("TrackToPlaylist", module_path=__module__)) class Album(Table): id = BigInt(primary_key=True) tracks = M2M(LazyTableReference("TrackToAlbum", module_path=__module__)) class TrackToPlaylist(Table): tracks = ForeignKey(Track) playlists= ForeignKey(Playlist) class TrackToAlbum(Table): tracks = ForeignKey(Track) albums= ForeignKey(Album) # Step 1 - I expect this to work fine if this is the first time the tracks with id 0-9 are being added to the db playlist = await Playlist.objects().get_or_create(Playlist.id == 1) tracks = [Track(track_id=i) for i in range(10)] await playlist.add_m2m(*tracks, m2m=Playlist.tracks) # Step 2 - After step 1 do the following playlist2 = await Playlist.objects().get_or_create(Playlist.id == 2) tracks2 = [Track(track_id=i) for i in range(5)] await playlist2 .add_m2m(*tracks2, m2m=Playlist.tracks) # Step 3 playlist3 = await Playlist.objects().get_or_create(Playlist.id == 3) tracks3 = [Track(track_id=i) for i in range(5)] tracks3 += [Track(track_id=i) for i in range(5)] await playlist3 .add_m2m(*tracks3, m2m=Playlist.tracks) # Step 4 - Run after step 1 album = await Album.objects().get_or_create(Album.id == 1) tracks4 = [Track(track_id=i) for i in range(5)] await album .add_m2m(*tracks4, m2m=Album.tracks) ``` What is expected to happen after steps 2, 3, and 4?

Currently it raises an exception

sinisaos commented 1 year ago

@Drapersniper @dantownsend I've been experimenting with preventing duplicate entries in m2m and I'm interested in your thoughts on this. I changed the _run() method to this and everything works fine and unittest pass.

async def _run(self):
    rows = self.rows
    unsaved = [i for i in rows if not i._exists_in_db]

    if unsaved:
        await self.rows[0].__class__.insert(*unsaved).run()

    joining_table = self.m2m._meta.resolved_joining_table

    joining_table_rows = []

    for row in rows:
        joining_table_row = joining_table(**self.extra_column_values)
        # select all rows can be problematic as the amount of data
        # in the joining_table increases
        unique_joining_table_rows = set(
            (
                i[self.m2m._meta.primary_foreign_key._meta.name],
                i[self.m2m._meta.secondary_foreign_key._meta.name],
            )
            for i in await joining_table.select(
                self.m2m._meta.primary_foreign_key,
                self.m2m._meta.secondary_foreign_key,
            )
        )
        setattr(
            joining_table_row,
            self.m2m._meta.primary_foreign_key._meta.name,
            getattr(
                self.target_row,
                self.target_row._meta.primary_key._meta.name,
            ),
        )
        setattr(
            joining_table_row,
            self.m2m._meta.secondary_foreign_key._meta.name,
            getattr(
                row,
                row._meta.primary_key._meta.name,
            ),
        )
        # the new joining_table row we want to add
        insert_row = (
            getattr(
                joining_table_row,
                self.m2m._meta.primary_foreign_key._meta.name,
            ),
            getattr(
                joining_table_row,
                self.m2m._meta.secondary_foreign_key._meta.name,
            ),
        )
        # if new insert_row is not already in joining_table we can add it,
        # otherwise skip adding to prevent duplicate rows in db
        if insert_row not in unique_joining_table_rows:
            joining_table_rows.append(joining_table_row)
        else:
            return
    return await joining_table.insert(*joining_table_rows).run()

I also think the docs should change the way we add rows to use the get_or_create method to prevent UNIQUE constraint error. Something like this

>>> band = await Band.objects().get(Band.name == "Pythonistas")
>>> genre = await Genre.objects().get_ot_create(Genre.name == "Punk rock")
>>> await band.add_m2m(
...    genre,
...     m2m=Band.genres
... )
[{'id': 1}]

Thanks.

dantownsend commented 1 year ago

@sinisaos Thanks for looking into it. I haven't had a chance yet to fully wrap my head around it, but hope to look into it soon.

dantownsend commented 1 year ago

Can we solve this by adding composite unique constraints to Piccolo. So on the joining table, we make the foreign keys unique together.

sinisaos commented 1 year ago

@dantownsend That might work and I tested it by adding something like this

unique_a = f"CREATE UNIQUE INDEX unique_a ON track_to_playlist(tracks, playlists)"
await TrackToPlaylist.raw(unique_a)

unique_b = "CREATE UNIQUE INDEX unique_b ON track_to_album(tracks, albums)"
await TrackToAlbum.raw(unique_b)

but if you run the script a second time, the error UNIQUE constraint failed appears.

Also if we want to add already existing tracks to the first playlist again like this

playlist = await Playlist.objects().get_or_create(Playlist.id == 1) # already exists
tracks6 = []
for i in range(5):
    t = await Track.objects().get_or_create(Track.track_id == i)
    tracks6.append(t) # already exists as tracks 0,1,2,3,4
await playlist.add_m2m(*tracks6, m2m=Playlist.tracks)

the same error appears.

I don't know if it is correct behavior, but this is not the case in Django (or the changes I made). We can run the script as many times as we want and if duplicates apears they are ignored and no error occurs. The biggest problem with my changes is that we scan the entire joining table looking for duplicates which could be a problem if the table gets large over time and it will affect performance. I hope I managed to explain what I mean.

Drapersniper commented 1 year ago

unsaved = [i for i in rows if not i._exists_in_db]

This would still cause it to error on insert for manually created rows without calling get_or_create right ?

Because the _exist_in_db would always be false for rows created manually.

If creating rows manually is not supported then that's fine - for my use case playlists and albums will have up to 10,000 tracks - manually calling get_or_create for each of these would be awfully slow and inefficient I usually have the whole track data before I add it to the tracks table, so I manually create them I.e Track(id=...) then I add them to the playlists using add_m2m.

I was expecting an on conflict update for my use case so if the track metadata changes if the same track gets added to a new playlist/album it updates the existing tracks table

sinisaos commented 1 year ago

@Drapersniper First, thank you for your reply.

unsaved = [i for i in rows if not i._exists_in_db]

This would still cause it to error on insert for manually created rows without calling get_or_create right ?

Because the _exist_in_db would always be false for rows created manually.

Yes, that is correct. Same as Django or Tortoise orm where we first need to save the record to the db so we can add to m2m.

I was expecting an on conflict update for my use case so if the track metadata changes if the same track gets added to a new playlist/album it updates the existing tracks table

I'm sorry if I don't understand the problem well, but if we use conflict update (for Sqlite something like INSERT OR IGNORE INTO track(track_id) VALUES ('new_value');) on the Track table, the number of rows in the join table (uses autoincrement serial primary key) will increase anyway, and already after step 2, the results will be wrong unless we somehow prevent the entry of duplicate records also in the join table. I'm not very familiar with Sqlalchemy and sorry if I'm wrong, but I think for your use case it would be the best solution because you can manually add records and nothing will be saved to the database until you call session.commit().

dantownsend commented 1 year ago

OK, we're closer to fixing this now we have the on_conflict clause.

Sorry for the delay on this.

We probably also need composite unique constraints, for both foreign keys in the M2M table.