rr- / szurubooru

Image board engine, Danbooru-style.
GNU General Public License v3.0
726 stars 182 forks source link

Tag-post counting and storage counter causes slow page loading times #608

Open phil-flip opened 1 year ago

phil-flip commented 1 year ago

Someone posted this before, but I was not able to find it anymore: Slow loading times with massive amounts of posts and tag suggestions. I have now ~600k posts in my szuru-instance and finally decided to look into, why it takes so long to serve pictures and individual posts. Turns out that the DB query for counting the posts of a tag takes ~7 seconds per post. And because it is running for every post on the Posts-Tab, it takes ages to load one page. Because this statistic is useless to me, I made an edit in posts.py. Instead of calling the counting-function, I changed “usages” to 0. Now the response times significantly improved by a long shot. Please consider removing this statistic as it really is bottle-necking the app. If it is really that important, then caching those values or manually keeping track in the tags-table would be reasonable solutions.

Another edit I made a while back on an older System, was to disable the file counting, as it was causing the home page to be stuck for 30+ Mins. (This is an approximation, as I have better stuff to do then sit there and wait for it to finish doing its thing but I did check every 10 mins or so. ^^') While the code has a caching function, it doesn't keep the value long enough as well as lose it after a restart of the app. I personally don't need it anymore, because the new ZFS-based NAS does some magic there to provide the storage-size faster, but it's still something that needs to be pointed out and addressed. Tho this is also editable, tho I did miss the statistic at the time and was happy to see it run fine, when moving the installation.

phil-flip commented 1 year ago

Alright. I took some time to fix the database issue and for me the best solution that I came up with is to make a separate table that has all the counts “pre-calculated”. While my original approach was valid, I still had issues, that every so often a function needed the tag count causing the whole operation to slow way down again. So after a little bit of research and 5 hours deep into this issue I found this lovely article, which goes into detail how PostgreSQL doesn't pre-count the total amount of entries of tables. Not quite what I needed, so modded the script a little to count all the tags, write them into a newly created table as well as create a DB function with triggers to keep it updated, so there was minimal code change necessary in the backend (I don't Python.)

CREATE TABLE tag_usage (
 tag_id   int PRIMARY KEY,
 usage    int DEFAULT 0
);
-- establish initial count
INSERT INTO tag_usage
  SELECT post_tag.tag_id AS tag_id, count(post_tag.post_id) AS usage
    FROM post_tag
    GROUP BY post_tag.tag_id;

-- Create function to update counts
CREATE OR REPLACE FUNCTION adjust_tag_usage()
RETURNS TRIGGER AS
'
 DECLARE
 BEGIN
 IF TG_OP = ''INSERT'' THEN
 EXECUTE ''UPDATE tag_usage set usage=usage +1 WHERE tag_id = '''''' || NEW.tag_id || '''''''';
 RETURN NEW;
  ELSIF TG_OP = ''DELETE'' THEN
 EXECUTE ''UPDATE tag_usage set usage=usage -1 WHERE tag_id = '''''' || OLD.tag_id || '''''''';
 RETURN OLD;
 END IF;
 END;
'
LANGUAGE 'plpgsql';
CREATE TRIGGER adjust_tag_usage BEFORE INSERT OR DELETE ON post_tag
  FOR EACH ROW EXECUTE PROCEDURE adjust_tag_usage();
COMMIT;

-- Create function to create and delete count entries
CREATE OR REPLACE FUNCTION update_tag_usage()
RETURNS TRIGGER AS
'
 DECLARE
 BEGIN
 IF TG_OP = ''INSERT'' THEN
 EXECUTE ''INSERT INTO tag_usage (tag_id, usage) VALUES ('''''' || NEW.id || '''''', 0)'';
 RETURN NEW;
  ELSIF TG_OP = ''DELETE'' THEN
 EXECUTE ''DELETE FROM tag_usage WHERE tag_id = '''''' || OLD.id || '''''''';
 RETURN OLD;
 END IF;
 END;
'
LANGUAGE 'plpgsql';
CREATE TRIGGER update_tag_usage BEFORE INSERT OR DELETE ON tag
  FOR EACH ROW EXECUTE PROCEDURE update_tag_usage();
COMMIT;

My apologies for the jankiness of the DB setup. But it works. (You might need to switch to "$$" but my SQL Editor of choice wanted single quotes.) Alongside the DB changes I also added the table to the DB model in tag.py-file:

class TagUsage(Base):
    __tablename__ = "tag_usage"

    tag_id = sa.Column(
        "tag_id",
        sa.Integer,
        nullable=False,
        primary_key=True,
    )
    usage = sa.Column(
        "usage",
        sa.Integer,
        default=0,
        nullable=True,
    )

    def __init__(self, tag_id: int, usage: int) -> None:
        self.tag_id = tag_id
        self.usage = usage

And ofc adjusted the count-function in the same file further down

    post_count = sa.orm.column_property(
        sa.sql.expression.select([TagUsage.usage])
        .where(TagUsage.tag_id == tag_id)
        .correlate_except(TagUsage)
    )

I would open up a PR, but I'm not experienced enough for python and don't know how to make a migration, triggers and DB-functions with sqlalchemy. I reverted my previous changes and map now that file in, and it works fine and szuru runs like a charm while also showing the tag-counts.

A fair warning: I only just started using this new modded version, so I don't know if there will be any issues with other functions I am not aware of at the time of writing. So in case there is anyone that has similar issues: Handle this with care and a grain of salt. But I will keep this open and updated, until there is proper support.

phil-flip commented 12 months ago

So I just noticed an issue when merging tags: It doesn't combine the numbers…so I might need to revisit the implementation of the count thingy in the future.

parallax4 commented 11 months ago

@neobooru What are your thoughts on this? Would be nice if something like this could be merged.

pelicans45 commented 5 months ago

@phil-flip @neobooru It would be good if we could get something like this implemented.

BloodyRain2k commented 4 months ago

Maybe I understand just too little about SQL / PostGres, but is there a special reason why a new table was created for the count instead of just stuffing the "cache" into the tag's row in a new column?

phil-flip commented 4 months ago

I didn't want to mess with existing tables, in case a future migration breaks it. So I created a new one, that the application isn't aware about and doesn't mess with it.

As I said earlier tho, this is a massive hack and the logic for it should be implemented into the API instead. I just don't python. So while the wish of a few people in here is to merge it. I would rather want to see a maintainer to implement it to the backend instead.

BloodyRain2k commented 4 months ago

it should be implemented into the API instead.

Tried that, because I don't understand Triggers. And while my implementation is just as hacky, it's barely "better than nothing". And it currently breaks tag sorting, and in turn auto_complete. So I think your approach was good. I'll take a shot at doing that without the extra table next, but first I'll have to learn Triggers...

Edit: so I've not even yet gotten to the Triggers, because I wanted to get an "init" update query working first. The result is this:

UPDATE tag
SET post_count = sub.count
FROM (SELECT pt.tag_id AS id, COUNT(pt.tag_id) AS count FROM post_tag pt GROUP BY pt.tag_id) AS sub
WHERE tag.id = sub.id

Derp, wasn't that hard after all:

    select = (db.session.query(PostTag.tag_id, sa.func.count(PostTag.tag_id).label("count"))
              .group_by(PostTag.tag_id).subquery())
    logger.info("Select statement:\n%s", select)
    update = (sa.update(Tag)
              .values(post_count = select.c.count)
              .where(Tag.tag_id == select.c.tag_id))
    logger.info("Update statement:\n%s", update)
    updated = db.session.execute(update)
    logger.info("Updated %d tags", updated.rowcount)
    db.session.commit()

Runs also in less than 4 sec, so I can leave that as a startup sync for now while I figure out the Triggers.