Open phil-flip opened 1 year ago
Alright. I took some time to fix the database issue and for me the best solution that I came up with is to make a separate table that has all the counts “pre-calculated”. While my original approach was valid, I still had issues, that every so often a function needed the tag count causing the whole operation to slow way down again. So after a little bit of research and 5 hours deep into this issue I found this lovely article, which goes into detail how PostgreSQL doesn't pre-count the total amount of entries of tables. Not quite what I needed, so modded the script a little to count all the tags, write them into a newly created table as well as create a DB function with triggers to keep it updated, so there was minimal code change necessary in the backend (I don't Python.)
CREATE TABLE tag_usage (
tag_id int PRIMARY KEY,
usage int DEFAULT 0
);
-- establish initial count
INSERT INTO tag_usage
SELECT post_tag.tag_id AS tag_id, count(post_tag.post_id) AS usage
FROM post_tag
GROUP BY post_tag.tag_id;
-- Create function to update counts
CREATE OR REPLACE FUNCTION adjust_tag_usage()
RETURNS TRIGGER AS
'
DECLARE
BEGIN
IF TG_OP = ''INSERT'' THEN
EXECUTE ''UPDATE tag_usage set usage=usage +1 WHERE tag_id = '''''' || NEW.tag_id || '''''''';
RETURN NEW;
ELSIF TG_OP = ''DELETE'' THEN
EXECUTE ''UPDATE tag_usage set usage=usage -1 WHERE tag_id = '''''' || OLD.tag_id || '''''''';
RETURN OLD;
END IF;
END;
'
LANGUAGE 'plpgsql';
CREATE TRIGGER adjust_tag_usage BEFORE INSERT OR DELETE ON post_tag
FOR EACH ROW EXECUTE PROCEDURE adjust_tag_usage();
COMMIT;
-- Create function to create and delete count entries
CREATE OR REPLACE FUNCTION update_tag_usage()
RETURNS TRIGGER AS
'
DECLARE
BEGIN
IF TG_OP = ''INSERT'' THEN
EXECUTE ''INSERT INTO tag_usage (tag_id, usage) VALUES ('''''' || NEW.id || '''''', 0)'';
RETURN NEW;
ELSIF TG_OP = ''DELETE'' THEN
EXECUTE ''DELETE FROM tag_usage WHERE tag_id = '''''' || OLD.id || '''''''';
RETURN OLD;
END IF;
END;
'
LANGUAGE 'plpgsql';
CREATE TRIGGER update_tag_usage BEFORE INSERT OR DELETE ON tag
FOR EACH ROW EXECUTE PROCEDURE update_tag_usage();
COMMIT;
My apologies for the jankiness of the DB setup. But it works. (You might need to switch to "$$" but my SQL Editor of choice wanted single quotes.)
Alongside the DB changes I also added the table to the DB model in tag.py
-file:
class TagUsage(Base):
__tablename__ = "tag_usage"
tag_id = sa.Column(
"tag_id",
sa.Integer,
nullable=False,
primary_key=True,
)
usage = sa.Column(
"usage",
sa.Integer,
default=0,
nullable=True,
)
def __init__(self, tag_id: int, usage: int) -> None:
self.tag_id = tag_id
self.usage = usage
And ofc adjusted the count-function in the same file further down
post_count = sa.orm.column_property(
sa.sql.expression.select([TagUsage.usage])
.where(TagUsage.tag_id == tag_id)
.correlate_except(TagUsage)
)
I would open up a PR, but I'm not experienced enough for python and don't know how to make a migration, triggers and DB-functions with sqlalchemy. I reverted my previous changes and map now that file in, and it works fine and szuru runs like a charm while also showing the tag-counts.
A fair warning: I only just started using this new modded version, so I don't know if there will be any issues with other functions I am not aware of at the time of writing. So in case there is anyone that has similar issues: Handle this with care and a grain of salt. But I will keep this open and updated, until there is proper support.
So I just noticed an issue when merging tags: It doesn't combine the numbers…so I might need to revisit the implementation of the count thingy in the future.
@neobooru What are your thoughts on this? Would be nice if something like this could be merged.
@phil-flip @neobooru It would be good if we could get something like this implemented.
Maybe I understand just too little about SQL / PostGres, but is there a special reason why a new table was created for the count instead of just stuffing the "cache" into the tag's row in a new column?
I didn't want to mess with existing tables, in case a future migration breaks it. So I created a new one, that the application isn't aware about and doesn't mess with it.
As I said earlier tho, this is a massive hack and the logic for it should be implemented into the API instead. I just don't python. So while the wish of a few people in here is to merge it. I would rather want to see a maintainer to implement it to the backend instead.
it should be implemented into the API instead.
Tried that, because I don't understand Triggers. And while my implementation is just as hacky, it's barely "better than nothing". And it currently breaks tag sorting, and in turn auto_complete. So I think your approach was good. I'll take a shot at doing that without the extra table next, but first I'll have to learn Triggers...
Edit: so I've not even yet gotten to the Triggers, because I wanted to get an "init" update query working first. The result is this:
UPDATE tag
SET post_count = sub.count
FROM (SELECT pt.tag_id AS id, COUNT(pt.tag_id) AS count FROM post_tag pt GROUP BY pt.tag_id) AS sub
WHERE tag.id = sub.id
Derp, wasn't that hard after all:
select = (db.session.query(PostTag.tag_id, sa.func.count(PostTag.tag_id).label("count"))
.group_by(PostTag.tag_id).subquery())
logger.info("Select statement:\n%s", select)
update = (sa.update(Tag)
.values(post_count = select.c.count)
.where(Tag.tag_id == select.c.tag_id))
logger.info("Update statement:\n%s", update)
updated = db.session.execute(update)
logger.info("Updated %d tags", updated.rowcount)
db.session.commit()
Runs also in less than 4 sec, so I can leave that as a startup sync for now while I figure out the Triggers.
Someone posted this before, but I was not able to find it anymore: Slow loading times with massive amounts of posts and tag suggestions. I have now ~600k posts in my szuru-instance and finally decided to look into, why it takes so long to serve pictures and individual posts. Turns out that the DB query for counting the posts of a tag takes ~7 seconds per post. And because it is running for every post on the Posts-Tab, it takes ages to load one page. Because this statistic is useless to me, I made an edit in
posts.py
. Instead of calling the counting-function, I changed “usages” to 0. Now the response times significantly improved by a long shot. Please consider removing this statistic as it really is bottle-necking the app. If it is really that important, then caching those values or manually keeping track in the tags-table would be reasonable solutions.Another edit I made a while back on an older System, was to disable the file counting, as it was causing the home page to be stuck for 30+ Mins. (This is an approximation, as I have better stuff to do then sit there and wait for it to finish doing its thing but I did check every 10 mins or so. ^^') While the code has a caching function, it doesn't keep the value long enough as well as lose it after a restart of the app. I personally don't need it anymore, because the new ZFS-based NAS does some magic there to provide the storage-size faster, but it's still something that needs to be pointed out and addressed. Tho this is also editable, tho I did miss the statistic at the time and was happy to see it run fine, when moving the installation.