scripting / feedlandInstall

Instructions for setting up a FeedLand server.
GNU General Public License v3.0
3 stars 1 forks source link

Query taking too long #31

Closed scripting closed 1 year ago

scripting commented 1 year ago

The other day, this query started taking 27 seconds to run on my account, which is longer than the timeout.

select * from items where flDeleted=false and feedurl in (select feedUrl from subscriptions where listName='davewiner') order by pubDate desc limit 175;

It just started happening, and at least one other user is seeing it. It has rendered FeedLand basically useless to me.

Now I'm in over my head as far as MySQL is concerned. Not sure how to debug this, or how to fix the problem. I'm to grind at it but if anyone has an idea how to approach this, help would be most welcome.

scripting commented 1 year ago

I tried the part in parens, and it took 0.11 sec and returned 594 rows.

(select feedUrl from subscriptions where listName='davewiner')

scripting commented 1 year ago

Since this performance problem just started, I looked at the server, and found this. I don't know what this means, but it doesn't look very good.

image

scripting commented 1 year ago

Is there a missing index?

https://github.com/scripting/feedland/blob/main/database/database.md

scotthansonde commented 1 year ago

@scripting I notice that the flDeleted field in the items table is not indexed. On my instance I tried adding an index for flDeleted:

ALTER TABLE`feedland`.`items` ADD INDEX (`flDeleted`);

That reduced the time of the query from 3.4 seconds to 0.006 seconds

scripting commented 1 year ago

@scotthansonde -- thank you.

it's so funny, you posted this fix as I was announcing the open source release of The FeedLand Platform.

so that we could get help with this. ;-)

i'll report on how it affects this issue.

scripting commented 1 year ago

@scotthansonde -- before doing anything on the feedland.org database, I want to see how this will be documented on the database.md page.

At the end we create three indexes:

create index itemFeed on items(feedUrl);
create index itemPubDate on items(pubDate);
create index subscriptionsListname on subscriptions(listname);

I propose to add:

create index itemDeleted on items(flDeleted);

If this is correct, you can just Like this comment.

iolaire commented 1 year ago

Hey Dave, a partial suggestion since I'm most familiar with Microsoft SQL Server. Using an EXIST statement over an IN statement is usually much faster. According to this post it is similar in MySQL https://www.aurigait.com/blog/mysql-in-vs-exist/, therefore I believe if you change your code to something like this it should be much faster: select * from items as i where flDeleted=false and EXISTS (select * from subscriptions as s where s.listName='davewiner' and i.feedUrl = s.feedUrl ) order by pubDate desc limit 175;

iolaire commented 1 year ago

Also, in SQL server we do something like this in the exists statement (select top 1 1 from subscriptions as s where s.listName='davewiner' and i.feedUrl = s.feedUrl ) so it doesn't return all the results, it just ends once the first result is hit, so you might also try something like this: select * from items as i where flDeleted=false and EXISTS (select 1 from subscriptions as s where s.listName='davewiner' and i.feedUrl = s.feedUrl LIMIT 1 ) order by pubDate desc limit 175; and use EXPLAIN to see if that improves things future.

scripting commented 1 year ago

@scotthansonde -- it didn't make the query faster, actually i had tried it without the flDeleted part of the query as follows:

select * from items where feedurl in (select feedUrl from subscriptions where listName='davewiner') order by pubDate desc limit 175;

And it took 29 seconds.

So I'm getting different results here (and I should have read your initial comment more carefully).

scripting commented 1 year ago

I brought the problem to ChatGPT because that's how we do it these days with some interesting results.

http://scripting.com/publicfolder/chatgpt/2023/04/24/mysql-exists-statement.html

I thought this bit was interesting:

Add an index: You can add an index on the feedurl column in the items table and the feedUrl and listName columns in the subscriptions table. This can speed up the search and join operations in the query.

There already is an index on feedurl in the items table, and an index on listName in subscriptions, but there is no index on feedurl in subscriptions.

create index subscriptionsFeedurl on subscriptions(feedUrl);

Result: No difference in performance.

scripting commented 1 year ago

Here's an interesting piece of data.

Somehow I lost my categories in the Settings command, and that's why I'm getting the huge timeout.

Once I reinstated my categories, things went back to how they were, performing as before.

https://feedland.org/?river=true&screenname=davewiner&catname=all

So this is only affecting people who have no categories, ie newbies.

iolaire commented 1 year ago

I've wanted to suggest that doing all those join on text columns are way less efficient than on an integer-based AUTO_INCREMENT ID column from each table. So here is the ChatGTP explanation:

in mysql is it faster to join on an integer column or a text column In general, joining on an integer column will be faster than joining on a text column in MySQL. This is because integer comparisons are simpler and faster for the database to perform than text comparisons.

When you join on an integer column, the database can use a simple numeric comparison to match the values in the join. This comparison is straightforward and efficient, allowing the database to quickly identify the matching rows.

On the other hand, when you join on a text column, the database needs to perform a string comparison. String comparisons are more complex than integer comparisons, and they can be slower to execute. Additionally, if the text column is indexed, the index size will be larger and more memory-intensive than an index on an integer column, which can also slow down the join performance.

That being said, there may be cases where joining on a text column is necessary or more appropriate. For example, if the values in the column are not easily represented as integers (e.g. names, addresses, or other non-numeric identifiers), joining on a text column may be necessary. Additionally, in some cases, using a text column for a join may allow for more flexibility in how the join is performed.

scripting commented 1 year ago

I'm going to take a fresh look at this tomorrow. To prepare for it, I went through the problem with ChatGPT.

http://scripting.com/publicfolder/chatgpt/2023/04/24/mysql-database-assistance.html

mterenzio commented 1 year ago

There are a number of long term changes to the modeling that I think would improve performance and ease of maintenance. Some have been mentioned like indexes, serial columns but also unique constraints and foreign keys, but these would also likely require code changes and query rewrites.

That said, in my experience with queries that suddenly start performing poorly, a surprising amount of time it also reflects not being able to fit the data set into memory. In other words, the table outgrew the available memory and is needing to go to disk.

Without more information, I can't say if that's the case, but thought it was worth a quick try, especially if you are on a cloud based system where adding RAM is just a few clicks.

Or before doing that, while the query is running, run the free command and note the memory and swap usage. I believe I recall you are on linux.

scripting commented 1 year ago

Thanks @mterenzio -- yes this is running on Ubuntu, hosted at Digital Ocean.

scripting commented 1 year ago

I'm getting a late start today, but I'm determined to fix this problem now because I have a juicy project I can't wait to get back to.

So here's the query --

select * from items where flDeleted=false and feedurl in (select feedUrl from subscriptions where listName='mrjones') order by pubDate desc limit 175;

That's the query we do for users who have no categories in the Settings dialog. Here's the screen shot) to really bring that home.

As soon as you add categories, the problem goes away.

If so, the problem might have been happening to newbies, who have no categories to start off with, and we just weren't hearing about it because all the users we're in contact with have categories?

I only saw this behavior myself when, due to a glitch, I lost my categories (a completely separate issue).

So the big question is, what's the difference in the query we use when getting the new items for a user with categories?

To be clear, we're only retrieving the category that's selected. You can see the delay if you click on a tab.

scripting commented 1 year ago

This query runs fast. It's what we run when we're displaying a category in a tabbed display. I'm going to study this and see if it's possible to use the same approach for category-less users.

select * from items where flDeleted=false and feedurl in ('http://nytimes.com/timeswire/feeds/','http://rss.nytimes.com/services/xml/rss/nyt/Baseball.xml','http://rss.nytimes.com/services/xml/rss/nyt/ProBasketball.xml','http://www.nytimes.com/svc/collections/v1/publish/www.nytimes.com/column/bret-stephens/rss.xml','http://www.nytimes.com/svc/collections/v1/publish/www.nytimes.com/column/charles-m-blow/rss.xml','http://www.nytimes.com/svc/collections/v1/publish/www.nytimes.com/column/david-brooks/rss.xml','http://www.nytimes.com/svc/collections/v1/publish/www.nytimes.com/column/ezra-klein/rss.xml','http://www.nytimes.com/svc/collections/v1/publish/www.nytimes.com/column/farhad-manjoo/rss.xml','http://www.nytimes.com/svc/collections/v1/publish/www.nytimes.com/column/frank-bruni/rss.xml','http://www.nytimes.com/svc/collections/v1/publish/www.nytimes.com/column/gail-collins/rss.xml','http://www.nytimes.com/svc/collections/v1/publish/www.nytimes.com/column/jamelle-bouie/rss.xml','http://www.nytimes.com/svc/collections/v1/publish/www.nytimes.com/column/maureen-dowd/rss.xml','http://www.nytimes.com/svc/collections/v1/publish/www.nytimes.com/column/michelle-goldberg/rss.xml','http://www.nytimes.com/svc/collections/v1/publish/www.nytimes.com/column/nicholas-kristof/rss.xml','http://www.nytimes.com/svc/collections/v1/publish/www.nytimes.com/column/paul-krugman/rss.xml','http://www.nytimes.com/svc/collections/v1/publish/www.nytimes.com/column/ross-douthat/rss.xml','http://www.nytimes.com/svc/collections/v1/publish/www.nytimes.com/column/thomas-l-friedman/rss.xml','https://donaldgmcneiljr1954.medium.com/feed','https://feeds.simplecast.com/54nAGcIl','https://rss.nytimes.com/services/xml/rss/nyt/Africa.xml','https://rss.nytimes.com/services/xml/rss/nyt/Americas.xml','https://rss.nytimes.com/services/xml/rss/nyt/ArtandDesign.xml','https://rss.nytimes.com/services/xml/rss/nyt/Arts.xml','https://rss.nytimes.com/services/xml/rss/nyt/AsiaPacific.xml','https://rss.nytimes.com/services/xml/rss/nyt/Automobiles.xml','https://rss.nytimes.com/services/xml/rss/nyt/Baseball.xml','https://rss.nytimes.com/services/xml/rss/nyt/Books.xml','https://rss.nytimes.com/services/xml/rss/nyt/Business.xml','https://rss.nytimes.com/services/xml/rss/nyt/Climate.xml','https://rss.nytimes.com/services/xml/rss/nyt/CollegeBasketball.xml','https://rss.nytimes.com/services/xml/rss/nyt/CollegeFootball.xml','https://rss.nytimes.com/services/xml/rss/nyt/Dance.xml','https://rss.nytimes.com/services/xml/rss/nyt/Dealbook.xml','https://rss.nytimes.com/services/xml/rss/nyt/DiningandWine.xml','https://rss.nytimes.com/services/xml/rss/nyt/EnergyEnvironment.xml','https://rss.nytimes.com/services/xml/rss/nyt/Europe.xml','https://rss.nytimes.com/services/xml/rss/nyt/FashionandStyle.xml','https://rss.nytimes.com/services/xml/rss/nyt/Golf.xml','https://rss.nytimes.com/services/xml/rss/nyt/Hockey.xml','https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml','https://rss.nytimes.com/services/xml/rss/nyt/Jobs.xml','https://rss.nytimes.com/services/xml/rss/nyt/Lens.xml','https://rss.nytimes.com/services/xml/rss/nyt/MediaandAdvertising.xml','https://rss.nytimes.com/services/xml/rss/nyt/MiddleEast.xml','https://rss.nytimes.com/services/xml/rss/nyt/MostEmailed.xml','https://rss.nytimes.com/services/xml/rss/nyt/Movies.xml','https://rss.nytimes.com/services/xml/rss/nyt/Music.xml','https://rss.nytimes.com/services/xml/rss/nyt/NYRegion.xml','https://rss.nytimes.com/services/xml/rss/nyt/Obituaries.xml','https://rss.nytimes.com/services/xml/rss/nyt/PersonalTech.xml','https://rss.nytimes.com/services/xml/rss/nyt/ProBasketball.xml','https://rss.nytimes.com/services/xml/rss/nyt/ProFootball.xml','https://rss.nytimes.com/services/xml/rss/nyt/RealEstate.xml','https://rss.nytimes.com/services/xml/rss/nyt/Science.xml','https://rss.nytimes.com/services/xml/rss/nyt/SmallBusiness.xml','https://rss.nytimes.com/services/xml/rss/nyt/Soccer.xml','https://rss.nytimes.com/services/xml/rss/nyt/Space.xml','https://rss.nytimes.com/services/xml/rss/nyt/Sports.xml','https://rss.nytimes.com/services/xml/rss/nyt/sunday-review.xml','https://rss.nytimes.com/services/xml/rss/nyt/SundayBookReview.xml','https://rss.nytimes.com/services/xml/rss/nyt/Technology.xml','https://rss.nytimes.com/services/xml/rss/nyt/Television.xml','https://rss.nytimes.com/services/xml/rss/nyt/Tennis.xml','https://rss.nytimes.com/services/xml/rss/nyt/Theater.xml','https://rss.nytimes.com/services/xml/rss/nyt/tmagazine.xml','https://rss.nytimes.com/services/xml/rss/nyt/Upshot.xml','https://rss.nytimes.com/services/xml/rss/nyt/US.xml','https://rss.nytimes.com/services/xml/rss/nyt/Well.xml','https://rss.nytimes.com/services/xml/rss/nyt/World.xml','https://rss.nytimes.com/services/xml/rss/nyt/YourMoney.xml','https://www.nytimes.com/services/xml/rss/nyt/Health.xml','https://www.nytimes.com/services/xml/rss/nyt/Travel.xml','https://www.nytimes.com/services/xml/rss/nyt/Weddings.xml') order by pubDate desc limit 175;

mterenzio commented 1 year ago

I bet that would be your best bet for now, splitting it up into two queries, first getting the results from the sub query (in the parentheses) and adding the result as an array or comma separated list of strings to the the other part of the query. May be an issue if it gets really, really big so you may need to add a limit of some sort to the first query

scripting commented 1 year ago

Dear friends, that was it.

It's basically history. The very first "river" function I wrote did it the way spec'd at the top of the thread.

Later, when I added categories, I figured out a faster way.

Never went back and put it in the basic river function because I never saw the slowness myself, and the testers all used categories too.

How to test

  1. Go to Settings.
  2. Empty out the first two strings in the Categories tab and Save.
  3. Choose My news from the first menu.

If you manage to do that before I update the server it will be so slow that it'll time out.

If you use it after I update the server (I'm going to do it as the next thing) it'll be as fast as if it were a category, because it was treated that way.

scripting commented 1 year ago

I just updated the server, ran the test before and after and the problem appears to be solved.

Hooray! Thanks to everyone who helped, esp @mterenzio.

scripting commented 1 year ago

BTW if anyone wants to get an idea how the pieces fit together, I just updated the FeedLand platform source.

https://github.com/scripting/feedland

Changes were made to

  1. database/database.js
  2. feedland.js