Open jackjamieson2 opened 5 years ago
@jackjamieson2 hey there. Been trying this out the last few days (and been liking it a lot, by the way), and this was one of the things I've noticed as well. Since this isn't a super high priority for you, I might be able to take this one, and least get a branch going that gets some of that initial work setup. Let me know if that's something you want some help with and we can talk DB table structure.
@JasonHoffmann and @jackjamieson2 I think we need to discuss what should be in the database structure by default. As changing it later is hard.
@JasonHoffmann, Thanks for offering to help out with this. I agree with @dshanske that we should talk through the db structure carefully so we can minimize the need to make changes in the future.
Feed items are stored as posts with a post type of yarns_microsub_post
. These are hidden from the dashboard and rest of the site
These posts have custom taxonomy to store:
• Channel (yarns_microsub_post_channel
): A string representing the channel into which the post was saved
• Type (yarns_microsub_post_type
): A string representing post type (article, note, reply, photo, checkin, etc.)
I also registered post statuses to track the read/unread status of each feed item, yarns_unread
and yarns_read
And I rely on the following properties stock to wordpress posts:
• Date published
• Date updated (I don't really do anything with this yet, but I'd like to preserve it to have the option to sort by date updated)
• Post ID - a unique identifier for each post
• Post Title: This is stored as {channel UID}|{permalink for the post}
-- The rationale for this choice was that it provides a simple way to check if a feed item already exists, i.e. if a given channel already contains a feed pointing to that permalink. This should be reconsidered and a custom database can do a better job
The content of each post is saved as json in the meta field, yarns_microsub_json
^^ That's all described in class-yarns-microsub-posts.php
ID: A unique identifier for each post. Probably just use an integer, though I wonder if a hashed string could be a better choice
Date Published - We should be able to sort by this column
Date updated - Is this necessary? Should we be able to sort by this column? If not, it can probably be dropped
Permalink: The URL of the feed item
Channel: An array of strings, each representing a channel to which this item was added
Type: A string representing the post type. We must be able to filter by post type when returning timelines.
Unread: A boolean representing if the post is read or unread.
JSON: A JSON representation of the post content to be returned when serving timelines.
To be honest, though, I'm not really sure about this structure, which is why I want to take some time to talk/think it through in more detail.
This would require editing several functions:
Any changes should consider future goals like #65
@jackjamieson2 thanks for the breakdown. I completely agree that this needs to be thought through and if you'd rather come back to this while you focus on other things, I totally get.
I think if you were to use a separate database table, it's an opportunity to take some of that JSON and move it into separate rows. Reasons for doing so being:
yarns_db_version
, and if the database ever needs to be updated, just check against that option value, and update the table and version number if that's required.That being said, storing a JSON blob is exactly what Aperture is doing (https://github.com/aaronpk/Aperture/blob/master/aperture/app/Entry.php#L12) and seeing as how there's a lot of arbitrary data being used, you wouldn't want to lose that altogether.
So maybe it would be best to have three tables.
yarns_posts
: Includes the main feed items, and includes:
ID
date_published
permalink
type
read
author
: This would be a simple string, see below for where a complex array of data would be storedname
summary
source_url
: Possible substitute for publication
or url
when necessary. Again, can make querying against this field far easier. There's an argument to be made to break this out into a separate table, but that might be a bit much.additional_data
: This field might give us a best of both worlds with your suggestion about JSON. This is where we can store a chunk of JSON for information that won't be queried against, such as additional author information, org
, and can be used to add new fields on the fly without needing a database upgradeyarns_channels
: Includes a list of channels
ID
Name
last_polled
UID
items
(anything else needed for this?)yarns_channels_relationships
: Join table to store connections between yarn posts and their associated channel to make querying, for instance, all yarn feed items that belong to X channel, very easy. You mentioned wanting to quickly look up a permalink to see if it's already been added to a channel? Since permalinks are guaranteed to be unique, that can be used as the post_id
here so it would be an incredibly efficient and quick lookup.
post_id
channel_uid
order
(only include if there's ever a plan to re-sort items across channels)And then that can be used to create the proper helper functions you'd need.
Anyway, let me know your thoughts! This is completely theoretical and entirely my opinion, so I'd be curious @jackjamieson2 and @dshanske what you think.
Thanks @JasonHoffmann, this is really helpful
I think if you were to use a separate database table, it's an opportunity to take some of that JSON and move it into separate rows. ...
* Changing the database structure later can be a bit difficult, but if you plan ahead a bit you can mitigate that. We can, for instance, store an option called `yarns_db_version`, and if the database ever needs to be updated, just check against that option value, and update the table and version number if that's required.
Agreed. I think though I'd want to make sure we're only putting what we have to into separate columns (i.e. properties we'd want to filter or sort against).
That being said, storing a JSON blob is exactly what Aperture is doing (https://github.com/aaronpk/Aperture/blob/master/aperture/app/Entry.php#L12) and seeing as how there's a lot of arbitrary data being used, you wouldn't want to lose that altogether.
Agreed
So maybe it would be best to have three tables.
yarns_posts
: Includes the main feed items, and includes:* `ID` * `date_published` * `permalink` * `type` * `read` * `author`: This would be a simple string, see below for where a complex array of data would be stored * `name` * `summary` * `source_url`: Possible substitute for `publication` or `url` when necessary. Again, can make querying against this field far easier. There's an argument to be made to break this out into a separate table, but that might be a bit much. * `additional_data`: This field might give us a best of both worlds with your suggestion about JSON. This is where we can store a chunk of JSON for information that won't be queried against, such as additional author information, `org`, and can be used to add new fields on the fly without needing a database upgrade
yarns_channels
: Includes a list of channels* `ID` * `Name` * `last_polled`
Remove last_polled, since this is done per feed.
* `UID` * `items` (anything else needed for this?)
Each feed would need the following:
URL
_last_polled
: timestamp indicating when feed was last polled_poll_frequency
: how often to poll this feed_empty_poll_count
: used to calculate _poll_frequency (less frequent polling if feeds are not updated often)
Additionally, it might be nice to have the following:display_name
- A user-editable display name for the feedBack to yarns_channels
One of way to implement #65 and #20 etc. would be to define special channels (and/or add modifying filters to existing channels). I've thought this would require some additional properties for each channel.
Since I haven't really decided how to implement filters and related features, I'm not sure how to best store this
yarns_channels_relationships
: Join table to store connections between yarn posts and their associated channel to make querying, for instance, all yarn feed items that belong to X channel, very easy. You mentioned wanting to quickly look up a permalink to see if it's already been added to a channel? Since permalinks are guaranteed to be unique, that can be used as thepost_id
here so it would be an incredibly efficient and quick lookup.* `post_id` * `channel_uid` * `order` (only include if there's ever a plan to re-sort items across channels)
Order is used to sort the channels themselves. In the current structure channels are written as JSON, and the order can be inferred while reading. But using an
order
column would probably be best if restructuring to a table.
A join table is a good idea, though I wonder if this database is complicated enough to need it. I suspect adding a channels
property to each feed item would be sufficient.
For that matter, although a table seems like a good fit for representing channels and feeds, the current solution, JSON stored to the yarns_channels
option, is working well. When it comes to the feed items, there are some clear benefits to switching to a custom table. I don't know if I see the same return on investment for time spent restructuring channels and feed data as well
Further support for switching to a custom table for feed items:
Not necessarily. There are a lot of execution factors with large datasets
@dshanske True, but I think worth investigating further. However, operations that affect large portions of the data like the two examples I listed are fairly rare.
If things seem to get too slow, it could also be a good idea to decrease the default retention period for post items. I had it set to 25 days on my install, which is way longer than I actually need
@jackjamieson2 okay cool, I think what you made sense. For some reason in my head, feed items and channels had a many to many relationship (feed items could belong to multiple channels). But yea, a join table is likely overkill since that's not the case. So you could have a channel_id
row in the yarns_posts
table, that just references the single ID of the channel. From there, keeping channels and sources in an option is fine for now, and that can always be upgraded to a separate table in the future.
So that would make it one single table, called yarns_posts
, with the fields:
ID
date_published
permalink
type
read
author
summary
source_url
channel_uid
additional_data
@dshanske in reference to:
Not necessarily. There are a lot of execution factors with large datasets
That's somewhat true, but not really what we're dealing with here, since the alternative at the moment is to store everything in posts
and postmeta
which certainly have greater performance impocts. If we index the read
column, and make it a boolean value, it will perform quite a bit better than a meta_query
lookup. I'm not sure at what point this would ever be slow, even with tens of thousands (or more) of entries. And in terms of deactivation, a DROP TABLE
will be executed more efficiently than trying to delete all of a certain post_type
.
To clarify, it is possible to subscribe to the same feed in multiple channels, so feed items should be able to handle multip;e channel IDs. But I think it would be simple to store multiple channel uids as a comma-separated list, json-encoded array, or similar. I imagine that might introduce a performance hit compared to a join table, but I imagine it would be minor.
On June 26, 2019 at 7:32:47 PM, Jay Hoffmann (notifications@github.com) wrote:
@jackjamieson2 https://github.com/jackjamieson2 okay cool, I think what you made sense. For some reason in my head, feed items and channels had a many to many relationship (feed items could belong to multiple channels). But yea, a join table is likely overkill since that's not the case. So you could have a channel_id row in the yarns_posts table, that just references the single ID of the channel. From there, keeping channels and sources in an option is fine for now, and that can always be upgraded to a separate table in the future.
So that would make it one single table, called yarns_posts, with the fields:
ID
date_published
permalink
type
read
author
summary
source_url
channel_uid
additional_data
@dshanske https://github.com/dshanske in reference to:
Not necessarily. There are a lot of execution factors with large datasets
That's somewhat true, but not really what we're dealing with here, since the alternative at the moment is to store everything in posts and postmeta which certainly have greater performance impocts. If we index the read column, and make it a boolean value, it will perform quite a bit better than a meta_query lookup. I'm not sure at what point this would ever be slow, even with tens of thousands (or more) of entries. And in terms of deactivation, a DROP TABLE will be executed more efficiently than trying to delete all of a certain post_type.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jackjamieson2/yarns-microsub-server/issues/82?email_source=notifications&email_token=AC5U5R6M23VVUOSL44CBPR3P4P4B5A5CNFSM4H2VYDRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYVDDYI#issuecomment-506081761, or mute the thread https://github.com/notifications/unsubscribe-auth/AC5U5R4W7GXZLWZGZ5KO5T3P4P4B5ANCNFSM4H2VYDRA .
Oh okay, thanks for the clarification. I think it would be a bit tough to do this without a join table, and I'm not exactly sure how it would be handled. Thinking through the SQL, if the relationship where one to many it would be something like (rough pseudo-code):
SELECT * FROM yarns_posts WHERE channel_id=XXX
But since it's many to many if you store it in, let's say a comma separated list, it would have to be:
SELECT * FROM yarns_posts WHERE channel_id LIKE %XXX%
Or something similar to above. A LIKE
query like that is far, far slower than either a direct query, or a query from a JOIN table. And it also can cause some issues, for instance, if there is some ID overlap, so that would be a bit worrisome. We could still use a JOIN table without the separate channels
table, and just leave that in the option for now, but I would still recommend using the join table, which is what WordPress does for taxonomies.
All of that being said, if you'd prefer to continue without it, let me know. I'm not sure when I'll have time to work on this, but hopefully can get started over the next few weeks.
Thanks Jay, I didn’t realize there was such a performance difference for that kind of query. In which case, I see what you mean. There shouldn’t be any ID overlap though, since IDs are unique hashes.
I’ll be occupied with Indie Web Summit through the weekend (feel free to join in remotely if you’re available!), and then seeing family for a bit, so I’m in no hurry. And since @dshanske has become my de facto WordPress/IndieWeb mentor, I’d like to talk with him about it too. Since it’s a foundational change, I just want to be careful to avoid potential future issues.
My overall sense is that this design makes sense, and I’m happy to have your input and help!
On June 26, 2019 at 8:31:19 PM, Jay Hoffmann (notifications@github.com) wrote:
Oh okay, thanks for the clarification. I think it would be a bit tough to do this without a join table, and I'm not exactly sure how it would be handled. Thinking through the SQL, if the relationship where one to many it would be something like (rough pseudo-code):
SELECT * FROM yarns_posts WHERE channel_id=XXX
But since it's many to many if you store it in, let's say a comma separated list, it would have to be:
SELECT * FROM yarns_posts WHERE channel_id LIKE %XXX%
Or something similar to above. A LIKE query like that is far, far slower than either a direct query, or a query from a JOIN table. And it also can cause some issues, for instance, if there is some ID overlap, so that would be a bit worrisome. We could still use a JOIN table without the separate channels table, and just leave that in the option for now, but I would still recommend using the join table, which is what WordPress does for taxonomies.
All of that being said, if you'd prefer to continue without it, let me know. I'm not sure when I'll have time to work on this, but hopefully can get started over the next few weeks.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jackjamieson2/yarns-microsub-server/issues/82?email_source=notifications&email_token=AC5U5RZROGWL5X6IUQL6AS3P4QC5PA5CNFSM4H2VYDRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYVFZBQ#issuecomment-506092678, or mute the thread https://github.com/notifications/unsubscribe-auth/AC5U5R6FHGLMBR5Q32QUBHDP4QC5PANCNFSM4H2VYDRA .
@jackjamieson2 Sounds good, no rush here either so let's come back to it at some point in the future.
I'll be checking in on the indieweb summit as much as I can, mostly just watching. Curious to see what comes out!
I've added #110 to the milestone for version 1.1.0, which I'm going to be working on, as a start to this. I want to make it easier to change the storage method when the time comes, and being as I want to add some unit testing, it seems like this would allow for that.
I'm revisiting this in light of https://indieweb.org/2021/Pop-ups/Microsub
The proposal that is going forward is the introduction of a Source ID. There are a lot of useful reasons for this, one being you won't always have a public permalink. So, your posts table has, instead of a channel ID, a source ID.
You have a channel table, which has the channel data, including which sources are in each channel. And your query ends up being to query the channel table to get the sources in the feed, then query the entries table for posts that have that source ID.
You have a source table, which has the data for the individual sources.
So, that's back to 3 tables. Thoughts?
That sounds good to me. It seems like, given the current implementation where posts are stored as a custom post type, and the post details are recorded as a JSON string in a custom field, it would be slow to filter by any properties that are within that JSON.
Of course this could be handled by making source ID a custom field for each post (i.e. instead of in the JSON). And, of course this also lends support to @JasonHoffmann's proposals above which would allow for flexible filtering
I was in the process of cleaning up my subscriptions by removing several items from my channels. Turns out the channels were not really deleted despite the backend coming back with HTML 200 each time. A manual refresh re-listed the deleted channels.
I tracked it down to the options
table where option_name
yarns_channels
holding all subbed items grew so huge that my max_allowed_packet
of 128MB
in my MySQL config killed the connection. Doubling the limit lets me finally delete channels again.
Picture me surprised. BTW I did start cleaning this up after the PHP processes regularly started to run out of max memory when retrieving channels via my reader. Lifted php_admin_value[memory_limit]
to kinda insane 1G
now.
The current structure is seriously bottle-necked.
My work on the redesign took this out of the option table...I have been occupied with other things... but will return someday
I thought I'd created this issue already, but I guess I was incorrect
Currently, feed items are stored as a hidden post type in the WP_Posts table. A few people have commented that this could be inconvenient, which I agree with.
Specifically:
So, I think at some point I should create a custom table for Yarns items.
The reason for using WP_Posts in the first place was to be able to use WordPress's WP_Query function, so I imagine I'd have to write some new functions to replace it.
Given this will take a bit of time, this is not a top priority for me, and instead I'll plan to wait until I have a bit of extra time to dedicate to this