Migrate gloss data model to support phrasal glosses

arrocke commented 8 months ago

Some languages require the ability to set a gloss on a group of words, what we will call a phrase. So we need to adapt our model to support multiple words on the same gloss.

The new db structure will be something like the diagram below.

erDiagram
    Language ||--o{ Gloss : has
    Word }o--|{ GlossPhraseWord : connects
    Gloss ||--|{ GlossPhraseWord : connects

Unfortunately, this means that all glosses will require an extra join. The Gloss table will need its own integer primary key, so that GlossPhrase can be a join table with a primary key that consists of a foreign key to the Word table and a foreign key to the Gloss table. Now we can associate multiple words with a single gloss.

We will also need to move footnotes and translator notes to point to the gloss table rather than the word. This will allow a phrasal gloss to share these notes regardless of which word is selected. The GlossHistory table will also need to point to the gloss table as a foreign key. We'll need to start recording the word ids that comprise the gloss. Since this is just historical data, we could use an array field to record the list of word for the gloss, that way we can reconstruct the history of which gloss phrase each word is part of.

The goal of this ticket is to introduce this structure into the database while preserving the current logic. We will add the API calls and UI elements for phrasal glosses in a different ticket.

Pertempto commented 8 months ago

Just to make sure I'm following...

The gloss phrases will consists of different combinations of glosses, depending on the language? In one language this sequence of three original language words will have three target language glosses, but in another language that sequence of three words might be one phrase with one one gloss?

That sounds tricky to implement in the UI, but I'm guessing you're working on the design for that?

Pertempto commented 8 months ago

This is also to clarify my understanding, I'm not critiquing your diagram, just trying to make sure I understand it clearly.

It's been a couple years since I worked with database diagrams... so I might be confused on a few of these. I'm using this chart for reference.

One gloss is associated with many languages
- Isn't this backwards? I thought a gloss represented a word in one particular language. So a language would have many glosses, but a single gloss is in only one language.
Many words are associated with many gloss phrases
- How is this working? If there's anything I learned from my database class - it's that many-to-many relationships are rare, hard to implement, and not usually what you want.
- Is this just illustrating that a particular word can be part of different gloss phrases, and that it really depends on the language to determine which gloss phrase a particular word is going to be associated with?

arrocke commented 8 months ago

My relationships in the diagram were backwards. This is fixed now. Sorry for the confusion. The goal is that a gloss in one language can be connected to multiple words, and those same words can be connected to multiple glosses in a different language.

A many to many relationship is required because a phrase consists of one or more words (many), and each word is going to be associated with glosses in one or more languages (also many).

One thing that might clear things up is if we add a Phrase table, and renamed the GlossPhraseWord table to just PhraseWord. That would communicate that we are building phrases from words in specific languages, and a phrase can have a gloss, footnotes, etc. When you query for glosses or notes for the language, you would first query for the phrases in that languages. That would look like this:

erDiagram
    Language ||--o{ Phrase : has
    Word }o--|{ PhraseWord : connects
    PhraseWord }|--|| Phrase : connects
    Phrase ||--o| Gloss : has
    Phrase ||--o| Footnote : has
    Phrase ||--o| TranslatorNote : has

Now that I'm looking at this diagram, I think it is much clearer what is going on. @Pertempto Let me know what you think.

Pertempto commented 8 months ago

Now that I'm looking at this diagram, I think it is much clearer what is going on

I agree 💯. I think having this Phrase table to join it all together is the way to go.

One thing I just thought of... how will this all join to the Verse table? Through the Word table as before? So to get all the glosses for a verse, we will need to do joins through Word, PhraseWord, and Phrase? That sounds reasonable as far as this chart goes, but will that cause performance issues?

arrocke commented 8 months ago

You don't actually need to join the Verse table, you can filter on words that have the verse id you are looking for. We rarely need to join the verse table because we reference everything by the verse id in this way.

I don't think performance will be an issue. It also shouldn't increase too much when we scale because there is a fixed limit for the number of rows connected to each language. If we index the gloss table on language id, then we will never exceed about 600,000 rows per table in these joins. Also, we can start with a normalized database and then optimize later if necessary, but there isn't a good reason to incur the development complexity of denormalizing if we don't need to

arrocke commented 8 months ago

We should take this in a few steps to follow Prisma's recommended migration plan.

Update the schema and migrate the database to track phrases from one or more words, and connect glosses and notes to phrases.
Update all of the write logic to handle single word phrases while retaining current functionality.
Migrate all existing data over
Confirm that we are writing the new structures correctly
Update all read logic to handle single word phrases
Update all write logic to stop writing word/language ids on glosses and notes
Remove word and language ids from glosses and notes, and make phraseIds required.
Start work on multi word phrases

globalbibletools / gbt

Migrate gloss data model to support phrasal glosses #376