Revise database structure

jackjamieson2 commented 5 years ago

I thought I'd created this issue already, but I guess I was incorrect

Currently, feed items are stored as a hidden post type in the WP_Posts table. A few people have commented that this could be inconvenient, which I agree with.

Specifically:

For blogs that use post IDs in permalinks, this can lead to wildly inflated post counts. e.g. post IDs for my site now number in the thousands, which is far more than I actually have posts.
If exporting the WP_posts table to move to a new site, the Yarns feed items will be included. This is probably not what anyone wants. This can be resolved if the user uninstalls Yarns before exporting (see #71) but that's not ideal.

So, I think at some point I should create a custom table for Yarns items.

The reason for using WP_Posts in the first place was to be able to use WordPress's WP_Query function, so I imagine I'd have to write some new functions to replace it.

Given this will take a bit of time, this is not a top priority for me, and instead I'll plan to wait until I have a bit of extra time to dedicate to this

JasonHoffmann commented 5 years ago

@jackjamieson2 hey there. Been trying this out the last few days (and been liking it a lot, by the way), and this was one of the things I've noticed as well. Since this isn't a super high priority for you, I might be able to take this one, and least get a branch going that gets some of that initial work setup. Let me know if that's something you want some help with and we can talk DB table structure.

dshanske commented 5 years ago

@JasonHoffmann and @jackjamieson2 I think we need to discuss what should be in the database structure by default. As changing it later is hard.

jackjamieson2 commented 5 years ago

@JasonHoffmann, Thanks for offering to help out with this. I agree with @dshanske that we should talk through the db structure carefully so we can minimize the need to make changes in the future.

Current structure for feed items

Feed items are stored as posts with a post type of yarns_microsub_post. These are hidden from the dashboard and rest of the site

These posts have custom taxonomy to store: • Channel (yarns_microsub_post_channel): A string representing the channel into which the post was saved • Type (yarns_microsub_post_type): A string representing post type (article, note, reply, photo, checkin, etc.)

I also registered post statuses to track the read/unread status of each feed item, yarns_unread and yarns_read

And I rely on the following properties stock to wordpress posts: • Date published • Date updated (I don't really do anything with this yet, but I'd like to preserve it to have the option to sort by date updated) • Post ID - a unique identifier for each post • Post Title: This is stored as {channel UID}|{permalink for the post} -- The rationale for this choice was that it provides a simple way to check if a feed item already exists, i.e. if a given channel already contains a feed pointing to that permalink. This should be reconsidered and a custom database can do a better job The content of each post is saved as json in the meta field, yarns_microsub_json

^^ That's all described in class-yarns-microsub-posts.php

Future structure ideas:

ID: A unique identifier for each post. Probably just use an integer, though I wonder if a hashed string could be a better choice

Date Published - We should be able to sort by this column

Date updated - Is this necessary? Should we be able to sort by this column? If not, it can probably be dropped

Permalink: The URL of the feed item

Channel: An array of strings, each representing a channel to which this item was added

Type: A string representing the post type. We must be able to filter by post type when returning timelines.

Unread: A boolean representing if the post is read or unread.

JSON: A JSON representation of the post content to be returned when serving timelines.

My rationale for having a single field with the JSON for each item is that it allows flexibility to add additional properties in the future without having to update the database structure.
The rationale for also having dedicated columns for some properties is that it should be simpler and faster to sort/filter by these properties.
There is a clear downside, which is that duplication will increase the database size.

To be honest, though, I'm not really sure about this structure, which is why I want to take some time to talk/think it through in more detail.

Additional considerations

This would require editing several functions:

Getting timelines
- Filtering by post type
- Sorting by date
- Getting post content
- Getting unread feed item counts per channel
Marking posts read/unread
Adding/deleting feed items
Checking if feed items already exist
Others?

Any changes should consider future goals like #65

JasonHoffmann commented 5 years ago

@jackjamieson2 thanks for the breakdown. I completely agree that this needs to be thought through and if you'd rather come back to this while you focus on other things, I totally get.

I think if you were to use a separate database table, it's an opportunity to take some of that JSON and move it into separate rows. Reasons for doing so being:

It is easier and more efficient to add and retrieve data. This is basically what MySQL tables are setup to do so might as well take advantage. So for instance, with #65 it would be easier to query against the name and summary for each piece of content
If you need to sort by that data later, you have that option
Changing the database structure later can be a bit difficult, but if you plan ahead a bit you can mitigate that. We can, for instance, store an option called yarns_db_version, and if the database ever needs to be updated, just check against that option value, and update the table and version number if that's required.

That being said, storing a JSON blob is exactly what Aperture is doing (https://github.com/aaronpk/Aperture/blob/master/aperture/app/Entry.php#L12) and seeing as how there's a lot of arbitrary data being used, you wouldn't want to lose that altogether.

So maybe it would be best to have three tables.

yarns_posts: Includes the main feed items, and includes:

ID
date_published
permalink
type
read
author: This would be a simple string, see below for where a complex array of data would be stored
name
summary
source_url: Possible substitute for publication or url when necessary. Again, can make querying against this field far easier. There's an argument to be made to break this out into a separate table, but that might be a bit much.
additional_data: This field might give us a best of both worlds with your suggestion about JSON. This is where we can store a chunk of JSON for information that won't be queried against, such as additional author information, org, and can be used to add new fields on the fly without needing a database upgrade

yarns_channels: Includes a list of channels

ID
Name
last_polled
UID
items (anything else needed for this?)

yarns_channels_relationships: Join table to store connections between yarn posts and their associated channel to make querying, for instance, all yarn feed items that belong to X channel, very easy. You mentioned wanting to quickly look up a permalink to see if it's already been added to a channel? Since permalinks are guaranteed to be unique, that can be used as the post_id here so it would be an incredibly efficient and quick lookup.

post_id
channel_uid
order (only include if there's ever a plan to re-sort items across channels)

And then that can be used to create the proper helper functions you'd need.

Anyway, let me know your thoughts! This is completely theoretical and entirely my opinion, so I'd be curious @jackjamieson2 and @dshanske what you think.

jackjamieson2 commented 5 years ago

Thanks @JasonHoffmann, this is really helpful

I think if you were to use a separate database table, it's an opportunity to take some of that JSON and move it into separate rows. ...

* Changing the database structure later can be a bit difficult, but if you plan ahead a bit you can mitigate that. We can, for instance, store an option called `yarns_db_version`, and if the database ever needs to be updated, just check against that option value, and update the table and version number if that's required.

Agreed. I think though I'd want to make sure we're only putting what we have to into separate columns (i.e. properties we'd want to filter or sort against).

That being said, storing a JSON blob is exactly what Aperture is doing (https://github.com/aaronpk/Aperture/blob/master/aperture/app/Entry.php#L12) and seeing as how there's a lot of arbitrary data being used, you wouldn't want to lose that altogether.

Agreed

So maybe it would be best to have three tables.

yarns_posts: Includes the main feed items, and includes:

* `ID`

* `date_published`

* `permalink`

* `type`

* `read`

* `author`: This would be a simple string, see below for where a complex array of data would be stored

* `name`

* `summary`

* `source_url`: Possible substitute for `publication` or `url` when necessary. Again, can make querying against this field far easier. There's an argument to be made to break this out into a separate table, but that might be a bit much.

* `additional_data`: This field might give us a best of both worlds with your suggestion about JSON. This is where we can store a chunk of JSON for information that won't be queried against, such as additional author information, `org`, and can be used to add new fields on the fly without needing a database upgrade

yarns_channels: Includes a list of channels

* `ID`

* `Name`

* `last_polled`

Remove last_polled, since this is done per feed.

* `UID`

* `items`
  (anything else needed for this?)

Each feed would need the following:

URL
_last_polled: timestamp indicating when feed was last polled
_poll_frequency: how often to poll this feed
_empty_poll_count: used to calculate _poll_frequency (less frequent polling if feeds are not updated often) Additionally, it might be nice to have the following:
display_name - A user-editable display name for the feed

Back to yarns_channels One of way to implement #65 and #20 etc. would be to define special channels (and/or add modifying filters to existing channels). I've thought this would require some additional properties for each channel. Since I haven't really decided how to implement filters and related features, I'm not sure how to best store this

yarns_channels_relationships: Join table to store connections between yarn posts and their associated channel to make querying, for instance, all yarn feed items that belong to X channel, very easy. You mentioned wanting to quickly look up a permalink to see if it's already been added to a channel? Since permalinks are guaranteed to be unique, that can be used as the post_id here so it would be an incredibly efficient and quick lookup.
* `post_id`

* `channel_uid`

* `order` (only include if there's ever a plan to re-sort items across channels)
Order is used to sort the channels themselves. In the current structure channels are written as JSON, and the order can be inferred while reading. But using an order column would probably be best if restructuring to a table.

A join table is a good idea, though I wonder if this database is complicated enough to need it. I suspect adding a channels property to each feed item would be sufficient.

For that matter, although a table seems like a good fit for representing channels and feeds, the current solution, JSON stored to the yarns_channels option, is working well. When it comes to the feed items, there are some clear benefits to switching to a custom table. I don't know if I see the same return on investment for time spent restructuring channels and feed data as well

jackjamieson2 commented 5 years ago

Further support for switching to a custom table for feed items:

Mass marking posts as read has a slow execution time. Probably room for optimization even with current storage method, but a better data structure could speed this up.
When deactivating Yarns, it deletes all feed items. If there are a lot of feed items this could cause "Fatal error: Maximum execution time of 30 seconds exceeded" - Pretty poor user experience! Presumably, this would be much faster with a custom table.

dshanske commented 5 years ago

Not necessarily. There are a lot of execution factors with large datasets

jackjamieson2 commented 5 years ago

@dshanske True, but I think worth investigating further. However, operations that affect large portions of the data like the two examples I listed are fairly rare.

If things seem to get too slow, it could also be a good idea to decrease the default retention period for post items. I had it set to 25 days on my install, which is way longer than I actually need

JasonHoffmann commented 5 years ago

@jackjamieson2 okay cool, I think what you made sense. For some reason in my head, feed items and channels had a many to many relationship (feed items could belong to multiple channels). But yea, a join table is likely overkill since that's not the case. So you could have a channel_id row in the yarns_posts table, that just references the single ID of the channel. From there, keeping channels and sources in an option is fine for now, and that can always be upgraded to a separate table in the future.

So that would make it one single table, called yarns_posts, with the fields:

ID
date_published
permalink
type
read
author
summary
source_url
channel_uid
additional_data

@dshanske in reference to:

Not necessarily. There are a lot of execution factors with large datasets

That's somewhat true, but not really what we're dealing with here, since the alternative at the moment is to store everything in posts and postmeta which certainly have greater performance impocts. If we index the read column, and make it a boolean value, it will perform quite a bit better than a meta_query lookup. I'm not sure at what point this would ever be slow, even with tens of thousands (or more) of entries. And in terms of deactivation, a DROP TABLE will be executed more efficiently than trying to delete all of a certain post_type.

jackjamieson2 commented 5 years ago

To clarify, it is possible to subscribe to the same feed in multiple channels, so feed items should be able to handle multip;e channel IDs. But I think it would be simple to store multiple channel uids as a comma-separated list, json-encoded array, or similar. I imagine that might introduce a performance hit compared to a join table, but I imagine it would be minor.

On June 26, 2019 at 7:32:47 PM, Jay Hoffmann (notifications@github.com) wrote:

@jackjamieson2 https://github.com/jackjamieson2 okay cool, I think what you made sense. For some reason in my head, feed items and channels had a many to many relationship (feed items could belong to multiple channels). But yea, a join table is likely overkill since that's not the case. So you could have a channel_id row in the yarns_posts table, that just references the single ID of the channel. From there, keeping channels and sources in an option is fine for now, and that can always be upgraded to a separate table in the future.

So that would make it one single table, called yarns_posts, with the fields:

ID
date_published
permalink
type
read
author
summary
source_url
channel_uid
additional_data

@dshanske https://github.com/dshanske in reference to:

Not necessarily. There are a lot of execution factors with large datasets

That's somewhat true, but not really what we're dealing with here, since the alternative at the moment is to store everything in posts and postmeta which certainly have greater performance impocts. If we index the read column, and make it a boolean value, it will perform quite a bit better than a meta_query lookup. I'm not sure at what point this would ever be slow, even with tens of thousands (or more) of entries. And in terms of deactivation, a DROP TABLE will be executed more efficiently than trying to delete all of a certain post_type.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jackjamieson2/yarns-microsub-server/issues/82?email_source=notifications&email_token=AC5U5R6M23VVUOSL44CBPR3P4P4B5A5CNFSM4H2VYDRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYVDDYI#issuecomment-506081761, or mute the thread https://github.com/notifications/unsubscribe-auth/AC5U5R4W7GXZLWZGZ5KO5T3P4P4B5ANCNFSM4H2VYDRA .

JasonHoffmann commented 5 years ago

Oh okay, thanks for the clarification. I think it would be a bit tough to do this without a join table, and I'm not exactly sure how it would be handled. Thinking through the SQL, if the relationship where one to many it would be something like (rough pseudo-code):

SELECT * FROM yarns_posts WHERE channel_id=XXX

But since it's many to many if you store it in, let's say a comma separated list, it would have to be:

SELECT * FROM yarns_posts WHERE channel_id LIKE %XXX%

Or something similar to above. A LIKE query like that is far, far slower than either a direct query, or a query from a JOIN table. And it also can cause some issues, for instance, if there is some ID overlap, so that would be a bit worrisome. We could still use a JOIN table without the separate channels table, and just leave that in the option for now, but I would still recommend using the join table, which is what WordPress does for taxonomies.

All of that being said, if you'd prefer to continue without it, let me know. I'm not sure when I'll have time to work on this, but hopefully can get started over the next few weeks.

jackjamieson2 commented 5 years ago

Thanks Jay, I didn’t realize there was such a performance difference for that kind of query. In which case, I see what you mean. There shouldn’t be any ID overlap though, since IDs are unique hashes.

I’ll be occupied with Indie Web Summit through the weekend (feel free to join in remotely if you’re available!), and then seeing family for a bit, so I’m in no hurry. And since @dshanske has become my de facto WordPress/IndieWeb mentor, I’d like to talk with him about it too. Since it’s a foundational change, I just want to be careful to avoid potential future issues.

My overall sense is that this design makes sense, and I’m happy to have your input and help!

On June 26, 2019 at 8:31:19 PM, Jay Hoffmann (notifications@github.com) wrote:

Oh okay, thanks for the clarification. I think it would be a bit tough to do this without a join table, and I'm not exactly sure how it would be handled. Thinking through the SQL, if the relationship where one to many it would be something like (rough pseudo-code):

SELECT * FROM yarns_posts WHERE channel_id=XXX

But since it's many to many if you store it in, let's say a comma separated list, it would have to be:

SELECT * FROM yarns_posts WHERE channel_id LIKE %XXX%

Or something similar to above. A LIKE query like that is far, far slower than either a direct query, or a query from a JOIN table. And it also can cause some issues, for instance, if there is some ID overlap, so that would be a bit worrisome. We could still use a JOIN table without the separate channels table, and just leave that in the option for now, but I would still recommend using the join table, which is what WordPress does for taxonomies.

All of that being said, if you'd prefer to continue without it, let me know. I'm not sure when I'll have time to work on this, but hopefully can get started over the next few weeks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jackjamieson2/yarns-microsub-server/issues/82?email_source=notifications&email_token=AC5U5RZROGWL5X6IUQL6AS3P4QC5PA5CNFSM4H2VYDRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYVFZBQ#issuecomment-506092678, or mute the thread https://github.com/notifications/unsubscribe-auth/AC5U5R6FHGLMBR5Q32QUBHDP4QC5PANCNFSM4H2VYDRA .

JasonHoffmann commented 5 years ago

@jackjamieson2 Sounds good, no rush here either so let's come back to it at some point in the future.

I'll be checking in on the indieweb summit as much as I can, mostly just watching. Curious to see what comes out!

dshanske commented 3 years ago

I've added #110 to the milestone for version 1.1.0, which I'm going to be working on, as a start to this. I want to make it easier to change the storage method when the time comes, and being as I want to add some unit testing, it seems like this would allow for that.

dshanske commented 3 years ago

I'm revisiting this in light of https://indieweb.org/2021/Pop-ups/Microsub

The proposal that is going forward is the introduction of a Source ID. There are a lot of useful reasons for this, one being you won't always have a public permalink. So, your posts table has, instead of a channel ID, a source ID.

You have a channel table, which has the channel data, including which sources are in each channel. And your query ends up being to query the channel table to get the sources in the feed, then query the entries table for posts that have that source ID.

You have a source table, which has the data for the individual sources.

So, that's back to 3 tables. Thoughts?

jackjamieson2 commented 3 years ago

That sounds good to me. It seems like, given the current implementation where posts are stored as a custom post type, and the post details are recorded as a JSON string in a custom field, it would be slow to filter by any properties that are within that JSON.

Of course this could be handled by making source ID a custom field for each post (i.e. instead of in the JSON). And, of course this also lends support to @JasonHoffmann's proposals above which would allow for flexible filtering

bekopharm commented 2 years ago

I was in the process of cleaning up my subscriptions by removing several items from my channels. Turns out the channels were not really deleted despite the backend coming back with HTML 200 each time. A manual refresh re-listed the deleted channels.

I tracked it down to the options table where option_name yarns_channels holding all subbed items grew so huge that my max_allowed_packet of 128MB in my MySQL config killed the connection. Doubling the limit lets me finally delete channels again.

Picture me surprised. BTW I did start cleaning this up after the PHP processes regularly started to run out of max memory when retrieving channels via my reader. Lifted php_admin_value[memory_limit] to kinda insane 1G now.

The current structure is seriously bottle-necked.

dshanske commented 2 years ago

My work on the redesign took this out of the option table...I have been occupied with other things... but will return someday

jackjamieson2 / yarns-microsub-server