standardebooks / web

The source code for the Standard Ebooks website.
https://standardebooks.org
Creative Commons Zero v1.0 Universal
234 stars 61 forks source link

Initial Ebook DB schema #346

Closed colagrosso closed 6 months ago

colagrosso commented 6 months ago

This is a pretty straightforward translation of the member variables of the PHP Ebook and related classes. I even tried to keep them in the same order. As we discussed, we'll need fields to represent when the ebook was released and its metadata was modified, and those are separate from when the DB record is created and updated. In the DB schema, I named them:

and I have a branch where I renamed the fields in Ebook.php to match.

Here are some sample records from each of the tables. Let me know if you'd like to see one from a specific book or just more in general:

https://gist.github.com/colagrosso/f8896c6af84886a70b2c0ec15c41108f

I'm open to changing the size or type of any field. There are some fields that won't fit into a varchar(255), so I went with varchar(511):

MariaDB [se]> SELECT EbookId, RepoFilesystemPath, Title, LENGTH(AdvancedEpubUrl) as len from Ebooks WHERE LENGTH(AdvancedEpubUrl) > 255 ORDER BY len;
+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+------+
| EbookId | RepoFilesystemPath                                                                                                                                                                                                                               | Title         | len  |
+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+------+
|     283 | /standardebooks.org/ebooks/ivan-bunin_short-fiction_s-s-koteliansky_d-h-lawrence_leonard-woolf_bernard-guilbert-guerney_the-russian-review_natalie-a-duddington.git                                                                              | Short Fiction |  297 |
|     545 | /standardebooks.org/ebooks/vladimir-korolenko_short-fiction_aline-delano_sergius-stepniak_william-westall_thomas-seltzer_the-russian-review_marian-fell_clarence-manning.git                                                                     | Short Fiction |  315 |
|     843 | /standardebooks.org/ebooks/leonid-andreyev_short-fiction_herman-bernstein_alexandra-linden_l-a-magnus_k-walter_w-h-lowe_the-russian-review_archibald-j-wolfe_john-cournos_r-s-townsend_maurice-magnus.git                                        | Short Fiction |  373 |
|     439 | /standardebooks.org/ebooks/aleksandr-kuprin_short-fiction_s-koteliansky_j-m-murry_stephen-graham_rosa-savory-graham_leo-pasvolsky_douglas-ashby_the-living-age_b-guilbert-guerney_alexander-gagarine_malcolm-w-davis.git                         | Short Fiction |  403 |
|     102 | /standardebooks.org/ebooks/ovid_metamorphoses_john-dryden_joseph-addison_laurence-eusden_arthur-maynwaring_samuel-croxall_nahum-tate_william-stonestreet_thomas-vernon_john-gay_alexander-pope_stephen-harvey_william-congreve_et-al.git         | Metamorphoses |  435 |
|     667 | /standardebooks.org/ebooks/guy-de-maupassant_short-fiction_ernest-boyd_storm-jameson_jeffery-e-jeffery_lafcadio-hearn_m-walter-dunne_henry-c-olinger_albert-m-cohn-mcmaster_dora-knowlton-ranous_bigelow-brown-co-inc_francis-steegmuller.git    | Short Fiction |  445 |
|     693 | /standardebooks.org/ebooks/leo-tolstoy_short-fiction_louise-maude_aylmer-maude_nathan-haskell-dole_constance-garnett_j-d-duff_leo-weiner_r-s-townsend_hagberg-wright_benjamin-tucker_everymans-library_vladimir-chertkov_isabella-fyvie-mayo.git | Short Fiction |  451 |
+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+------+
MariaDB [se]> SELECT EbookId, TocEntry, LENGTH(TocEntry) as len from TocEntries WHERE LENGTH(TocEntry) > 255 ORDER BY len;
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
| EbookId | TocEntry                                                                                                                                                                                                                                                                    | len  |
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|     667 | : “This Story Will Show You That, if You Want to Save a Fellow Creature from Blows and Believe That It Is Better to Rescue a Cat Than a Man, You Will Excite the Anger of Your Neighbours. All Roads Lead to Rome—But Metempsychosis Leads to the Lunatic Asylum.”—         |  267 |
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
1 row in set (0.005 sec)

These fields will fit in varchar(255), but were a little close for comfort, reaching ~240 for Short Fiction by Leo Tolstoy and others, so I bumped them up too:

MariaDB [se]> SELECT EbookId, RepoFilesystemPath, Title, LENGTH(Identifier) as len from Ebooks WHERE LENGTH(Identifier) > 200 ORDER BY len;
+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+-----+
| EbookId | RepoFilesystemPath                                                                                                                                                                                                                               | Title         | len |
+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+-----+
|     843 | /standardebooks.org/ebooks/leonid-andreyev_short-fiction_herman-bernstein_alexandra-linden_l-a-magnus_k-walter_w-h-lowe_the-russian-review_archibald-j-wolfe_john-cournos_r-s-townsend_maurice-magnus.git                                        | Short Fiction | 208 |
|     439 | /standardebooks.org/ebooks/aleksandr-kuprin_short-fiction_s-koteliansky_j-m-murry_stephen-graham_rosa-savory-graham_leo-pasvolsky_douglas-ashby_the-living-age_b-guilbert-guerney_alexander-gagarine_malcolm-w-davis.git                         | Short Fiction | 223 |
|     102 | /standardebooks.org/ebooks/ovid_metamorphoses_john-dryden_joseph-addison_laurence-eusden_arthur-maynwaring_samuel-croxall_nahum-tate_william-stonestreet_thomas-vernon_john-gay_alexander-pope_stephen-harvey_william-congreve_et-al.git         | Metamorphoses | 239 |
|     667 | /standardebooks.org/ebooks/guy-de-maupassant_short-fiction_ernest-boyd_storm-jameson_jeffery-e-jeffery_lafcadio-hearn_m-walter-dunne_henry-c-olinger_albert-m-cohn-mcmaster_dora-knowlton-ranous_bigelow-brown-co-inc_francis-steegmuller.git    | Short Fiction | 244 |
|     693 | /standardebooks.org/ebooks/leo-tolstoy_short-fiction_louise-maude_aylmer-maude_nathan-haskell-dole_constance-garnett_j-d-duff_leo-weiner_r-s-townsend_hagberg-wright_benjamin-tucker_everymans-library_vladimir-chertkov_isabella-fyvie-mayo.git | Short Fiction | 247 |
+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+-----+

There's a GitCommit message that's close to the varchar(255) limit at 214. I considered bumping that field up, too, but didn't. The convention is that the first line should be 50 characters or less, so this one is way outside that. Let me know what you think.

I'm not quite ready to merge the PHP code or the script yet, but if it would help review the schema to see how it will be used in practice here they are in a separate branch:

Some notes on the code changes:

  1. When updating an Ebook, there is a lot of deleting and inserting related records, like we do with artwork tags, but with more tables.
  2. I added a UrlName member to the Collections class to match the DB schema here. When it's time, we'll lookup Collections by that UrlName, which matches how they are pulled out of the APCu.
  3. I had to change EbookSourceType in to a backed enum so that we have a type to write to the DB.

Even though the code and the script still need some work and won't be ready until a later PR, I'm confident that we can use the schema above with minimal annoying changes in the future.

colagrosso commented 6 months ago

Here is a list of the Ebook.php properties that we are not storing in the DB because they are derived from other properties:

Url
    from WwwFilesystemPath
HasDownloads
    from KindleCoverUrl...Azw3Url
UrlSafeIdentifier 
    from Identifier
HeroImageUrl
HeroImageAvifUrl
HeroImage2xUrl
HeroImage2xAvifUrl
CoverImageUrl
CoverImageAvifUrl
CoverImage2xUrl
CoverImage2xAvifUrl
    from UrlSafeIdentifier and the hash of the most recent Git commit
ReadingEaseDescription
    from ReadingEase
ReadingTime
    from WordCount
AuthorsHtml
    from Authors
AuthorsUrl
    from Identifier
ContributorsHtml
    from Contributors
TitleWithCreditsHtml
    from Title, Authors, and Contributors
TextUrl
    from Url
TextSinglePageUrl
    from Url
colagrosso commented 6 months ago

I think we're basically good to go, with the general caveat of preferring text column + fulltext index instead of varchar column when dealing with searchable fields.

Thank you for bringing up indexing and searching. We need more work here. I added the indexes you suggested, and I cooked up a query that would eventually replace what's in Library::FilterEbooks() and Ebook::Contains().

It didn't work the way I'd hoped. Because the fulltext indexes are in different tables, when I join and search across all the columns, the explain plan shows that the fulltext indexes aren't used:

MariaDB [se]> EXPLAIN SELECT DISTINCT e.* 
  from Ebooks e
    left join Contributors con using (EbookId)
    left join Collections col using (EbookId)
    left join EbookTags ebt using (EbookId)
    left join Tags t using (TagId)
    left join EbookLocSubjects els using (EbookId)
    left join LocSubjects ls using (LocSubjectId)
    left join TocEntries toc using (EbookId)
  where MATCH (e.Title, e.FullTitle, e.AlternateTitle, e.Description, e.LongDescription) AGAINST ('foo')
    OR (con.MarcRole = 'aut' AND MATCH(con.Name) AGAINST('foo'))
    OR MATCH(col.Name) AGAINST ('foo')
    OR MATCH(t.Name) AGAINST ('foo')
    OR MATCH(ls.Name) AGAINST ('foo')
    OR MATCH(toc.TocEntry) AGAINST ('foo');
+------+-------------+-------+--------+---------------+-----------+---------+---------------------+------+-----------------------+
| id   | select_type | table | type   | possible_keys | key       | key_len | ref                 | rows | Extra                 |
+------+-------------+-------+--------+---------------+-----------+---------+---------------------+------+-----------------------+
|    1 | SIMPLE      | e     | ALL    | NULL          | NULL      | NULL    | NULL                | 797  | Using temporary       |
|    1 | SIMPLE      | con   | ref    | index1        | index1    | 4       | se.e.EbookId        | 1    | Distinct              |
|    1 | SIMPLE      | col   | ref    | index1        | index1    | 4       | se.e.EbookId        | 1    | Distinct              |
|    1 | SIMPLE      | ebt   | ref    | idxUnique     | idxUnique | 4       | se.e.EbookId        | 1    | Using index; Distinct |
|    1 | SIMPLE      | t     | eq_ref | PRIMARY       | PRIMARY   | 4       | se.ebt.TagId        | 1    | Using where; Distinct |
|    1 | SIMPLE      | els   | ref    | idxUnique     | idxUnique | 4       | se.e.EbookId        | 1    | Using index; Distinct |
|    1 | SIMPLE      | ls    | eq_ref | PRIMARY       | PRIMARY   | 4       | se.els.LocSubjectId | 1    | Using where; Distinct |
|    1 | SIMPLE      | toc   | ref    | index1        | index1    | 4       | se.e.EbookId        | 1    | Using where; Distinct |
+------+-------------+-------+--------+---------------+-----------+---------+---------------------+------+-----------------------+
8 rows in set (0.001 sec)

Compare the above to a simpler query that does use the fulltext index from a single table:

MariaDB [se]> EXPLAIN SELECT e.*
  from Ebooks e
  where MATCH (e.Title, e.FullTitle, e.AlternateTitle, e.Description, e.LongDescription) AGAINST ('foo');
+------+-------------+-------+----------+---------------+-----------+---------+------+------+-------------+
| id   | select_type | table | type     | possible_keys | key       | key_len | ref  | rows | Extra       |
+------+-------------+-------+----------+---------------+-----------+---------+------+------+-------------+
|    1 | SIMPLE      | e     | fulltext | idxSearch     | idxSearch | 0       |      | 1    | Using where |
+------+-------------+-------+----------+---------------+-----------+---------+------+------+-------------+
1 row in set (0.000 sec)

Even a single join causes the fulltext index not to be used:

MariaDB [se]> EXPLAIN SELECT e.*
  from Ebooks e
    left join Collections col using (EbookId)
  where MATCH(col.Name) AGAINST ('foo');
+------+-------------+-------+------+---------------+--------+---------+--------------+------+-------------+
| id   | select_type | table | type | possible_keys | key    | key_len | ref          | rows | Extra       |
+------+-------------+-------+------+---------------+--------+---------+--------------+------+-------------+
|    1 | SIMPLE      | e     | ALL  | NULL          | NULL   | NULL    | NULL         | 797  |             |
|    1 | SIMPLE      | col   | ref  | index1        | index1 | 4       | se.e.EbookId | 1    | Using where |
+------+-------------+-------+------+---------------+--------+---------+--------------+------+-------------+
2 rows in set (0.001 sec)

I think our options are:

  1. Add a new IndexableText column to Ebooks, like we considered for artworks. That is, dump another copy of title, alternate title, authors, tags, etc. into that column, and add a fulltext index just to that column. I'm leaning toward this approach. I'll start to prototype it so that you can see what it looks like and check for other gotchas.
  2. Stick with the joins, similar to above, but probably cleaned up a bit. Might as well remove the fulltext indexes because they aren't part of the query execution plan.
  3. Some other MySQL tricks I haven't thought of yet. Maybe 6 queries to each of the tables against the fulltext indexes?

We'll need a load test of the whole site when it's powered by the DB, but performance of the queries with all the joins and without the indexes seems fine. Similar to the artwork queries on my machine. Not sure yet how it will hold up to a future front page HN post.

I agree that we're close, but it's probably worth more time planning indexing and searching before committing the schema. I'll answer your other comments above inline.

colagrosso commented 6 months ago

I can revert this, but I pushed a change that's option 1 above for you to take a look at: Adding a new IndexableText column to Ebooks and removing the other fulltext indexes.

In a separate branch, here's the draft PHP code that populates IndexableText:

https://github.com/colagrosso/se-web/blob/bedd59d953b07771b3b1d1b459194b3a3ddc07fc/lib/Ebook.php#L81

I changed the Ebook class to extend Accessor, and the new GetIndexableText() reuses the logic from Contains(string $query), except I also added Description and LongDescription to the indexable text.

We'll need extra query logic for the sort orders, but the core search is straightforward because it's on one column and no joins:

MariaDB [se]> EXPLAIN SELECT e.* from Ebooks e where MATCH(e.IndexableText) AGAINST ('foo');
+------+-------------+-------+----------+---------------+-----------+---------+------+------+-------------+
| id   | select_type | table | type     | possible_keys | key       | key_len | ref  | rows | Extra       |
+------+-------------+-------+----------+---------------+-----------+---------+------+------+-------------+
|    1 | SIMPLE      | e     | fulltext | idxSearch     | idxSearch | 0       |      | 1    | Using where |
+------+-------------+-------+----------+---------------+-----------+---------+------+------+-------------+
acabal commented 6 months ago

Even a single join causes the fulltext index not to be used:

Correct, in MariaDB only one index can be used per table, so if we have a join then it's going to use the index that's convenient to join on.

I think the way to go is to create a summary column like you mention, IndexableText. That will also simplify the query greatly.

colagrosso commented 6 months ago

I think the way to go is to create a summary column like you mention, IndexableText. That will also simplify the query greatly.

Great. Done.

I think we're all set here to merge this PR.

You might as well wait and not add these tables to your production DB yet. In the next PR, I'll send the PHP code and the update script, and that might raise other improvements. I'd hate for you to waste time creating empty tables only to drop and recreate them again.

acabal commented 6 months ago

OK great. I'm merging this into a new db-rewrite branch, and we'll see it through there instead of merging into master while things are incomplete. I'm not going to do any live DB work until we're basically done. Future PRs can go against that branch.