jacobwb / hashover-next

This branch will be HashOver 2.0
GNU Affero General Public License v3.0
420 stars 87 forks source link

Can't create SQL table for page with a long URL #247

Open ghost opened 5 years ago

ghost commented 5 years ago

Problem is visible here: https://northcoastsynthesis.com/news/modular-synthesis-intro-part-9-other-filter-designs ; error message "HashOver: Failed to create "news-modular-synthesis-intro-part-9-other-filter-designs" table!" appears instead of the comment widget.

It appears that, in SQL mode, HashOver creates a table in the database for every comment thread, and that table's name is derived from the URL of the page, with no limit on its length. This is a problem because SQL databases do not allow tables with unlimited-length names. MariaDB, which I'm using, allows up to 64 characters. The quoted string in the error message is only 56, but from looking at the code I think HashOver actually adds "/metadata" (9 more characters) to that string to form the name of a table it wants to create, and that brings it to 65.

I don't like creating an entire table for every comment thread; when you have many similarly-structured pieces of data, kind of the point of a relational database is that you have one table for all of them. Database software is designed to work with a relatively small set of tables which can each grow to contain many rows; not a large and growing set of tables which each only contain one or a few rows. The thing that expands dynamically ought to be the number of rows per table, not the number of tables.

However, be that as it may, at the very least it seems the table names have to be kept to a limited size. I suggest one of three things, and would be willing to do some coding to help implement them:

  1. (best) Change the database schema to not create tables per thread. Advantage: this is how SQL databases are supposed to be used, and would likely prevent future scaling problems. Disadvantage: most work to implement.
  2. Limit the length of table names by detecting when one would be too long, and in such a case, construct a new "mangled" name out of a hash of the entire name, plus as many characters of the end of the original name, as will fit. Advantage: this means no change to the existing behaviour for short names, and preserves the information in a longer name to prevent collisions if the names should be identical on the characters that were retained. Disadvantage: names are no longer fully human-readable, and collisions though unlikely are still possible.
  3. (least good) Just use the last N characters of the URL (as many as the database can handle). Advantage: simplest to implement of the things likely to work at all, no change to existing behaviour for short names. Disadvantage: collisions much more likely, should two long URLs differ only in the cut-off characters.
jacobwb commented 5 years ago

I agree with most of what you say here, your diagnosis is mostly correct.

However, just for the sake of argument...

  1. The thing that expands dynamically is the rows. HashOver is most often used on blogs, where there may be a hundred or so pages (or tables) with each page containing a couple hundred or even a couple thousand comments (or rows). This means the majority of the database is rows.

  2. I know how databases are supposed to work. This approach was taken on purpose, to help non-technical users find comments from a specific page without having to execute any SQL statements. And it aided in testing as well, since to delete all test comments all I needed to do is delete the table, and only those comments are removed from the database, again without using a SQL statement.

    For this reason, the character limit is a much stronger argument against multiple tables than simply wanting everything in one table.

Those points are moot now, though, as I just pushed code that should fix this issue.

Of your listed solutions, solution 1 is what is essentially implemented now. And I would really appreciate you testing the new approach and offering any improvements or criticisms of the new code.

ghost commented 5 years ago

I should be so lucky as to have a couple hundred comments on a Web log entry. Most of mine never get comments at all. But whether it's average one comment per entry, ten comments, or one thousand, the point is the shape of the growth: number of entries keeps increasing steadily over time, whereas number of comments on one entry basically approaches a constant. Number of comments on one entry is not going to increase much if any, after whatever discussion occurs immediately after posting. A lot of Web log software automatically "closes" comments on old entries but it's not even necessary: apart from spam, old entries basically don't get comments. So even if a site had more comments per page than it had pages at one point in time (which I don't think has ever described my own site) it will eventually have more pages than it has comments per page, just because the number of pages grows without limit while the comments per page don't. The change in overall shape of the database only goes in one direction.

I hope I didn't give offense with my comment about how relational databases ought to be used. I understand there may have been important reasons for the design to have been done the way it was.

The new code looks good. It'll probably be a few days before I get the chance to test it, but I'll get back with my thoughts on it once I do.

ghost commented 5 years ago

Okay, I've had the chance to attempt using this now. Unfortunately, I haven't been able to get it to work. It's worse than before because now HashOver can't take comments in database mode at all, whereas before the multi-site feature, it was only misbehaving on pages with long URLs.

First, on database naming: when used with a backend other than SQLite, HashOver automatically prepends "hashover-" to the configured database name (lines 88-91 of database.php). So when my configured database name is "hashover", it tries to use a database named "hashover-hashover". I don't see a reason for it to do that and I commented out the lines in question. I could probably have worked around this point by actually naming my database "hashover-hashover" but that's only possible because I run my own SQL server; many people on hosting providers are stuck with provider-assigned SQL database names and do not have the opportunity to do this. It'd be better to just use the configured name.

With that change made, it's able to connect to the database, but fails on every page with a message saying it cannot create the "page-info" table. I believe this is because the table-creation code for the "page-info" table is shared with code that creates other tables and attempts to add a "PRIMARY KEY (domain, thread)" clause to every table creation statement. Since there are no columns named "domain" and "thread" in the page-info table, this fails. I had a hard time finding out what the schema for this (or any part of the database) is actually supposed to be, but my best guess is that the page-info table is supposed to contain two columns, both text, named "url" and "title" (from line 163 of hashover.php).

I tried manually creating a table named page-info with those columns, logging into the SQL server using hashover's user ID. Looking at the code again as I write this, I also note that there's a conditional on the "Allow page metadata to be updated from localhost" setting, which I had set to the default of false. But changing that to "true" seems to make no difference.

Once I had manually created the page-info table, HashOver no longer complains about the page-info table not existing, but it gives a "Failed to save metadata!" error on every page. My guess is that it's a similar issue with a table needing to exist that it can't create automatically for some reason, but without a schema I don't think I can go much further.

ghost commented 5 years ago

A further discovery is that the moderation admin page seems not to work at all when multisite is turned on. Since I can't really afford to have a non-functioning comment system on my main site, I've switched back to file-based storage, and if I do further testing on HashOver's SQL features I'll do that in a separate testing environment.

jacobwb commented 5 years ago

Thanks for testing the new code and for all the information about its behavior.

I believe the code is fixed now. To avoid repeating myself, please read the ea2658c commit message for more details about the cause of the issue. Yes, the code uses multiple abstractions, which may hide the syntax of the SQL statements and schema a bit, but is necessary to support multiple data formats.

The intended schema looks like this...

"comments" table

CREATE TABLE `comments` (
    `domain` TEXT,
    `thread` TEXT,
    `comment` TEXT,
    `body` TEXT,
    `status` TEXT,
    `date` TEXT,
    `name` TEXT,
    `password` TEXT,
    `login_id` TEXT,
    `email` TEXT,
    `encryption` TEXT,
    `email_hash` TEXT,
    `notifications` TEXT,
    `website` TEXT,
    `ipaddr` TEXT,
    `likes` INTEGER,
    `dislikes` INTEGER
)

"page-info" table

CREATE TABLE `page-info` (
    `domain` TEXT,
    `thread` TEXT,
    `url` TEXT,
    `title` TEXT
)
adnan360 commented 4 years ago

@jacobwb I was looking for a structure for creating mysql table and found this. I created a table and imported data from a wp install through this script.

After importing I couldn't edit any data in the table from phpmyadmin. I searched around and found that not having a primary key may have caused this. I added a primary auto increment column named comment_id at the beginning and edit and other options were available.

I also noticed that having a primary key is the standard. I am not sure if this structure is anywhere in the docs (as far as I remember it wasn't in the docs and I had to dig through issues to find the structure when I first did the setup). But would think having a primary key in the structure would be great. I am facing no issues with it having the column, so I intend to keep it. Let me know your thoughts.

Also, I think I should add one for the page-info table as well. *EDIT: Spelling