mysociety / alaveteli

Provide a Freedom of Information request system for your jurisdiction
https://alaveteli.org
Other
387 stars 195 forks source link

Import case studies from blog #6589

Closed garethrees closed 1 year ago

garethrees commented 2 years ago

In general we feel we should make more use of the case studies we write for our blog (https://github.com/mysociety/alaveteli-project/issues/143).

One way of doing this would be to give them more prominence within the site itself (as a more "manual" version of https://github.com/mysociety/alaveteli/issues/6253).

This could/should be combined with adding more illustration and graphical content (https://github.com/mysociety/alaveteli/issues/6586) by accompanying the posts with a small thumbnail.

As it stands, we only import a limited number of posts via the blog feed (https://github.com/mysociety/alaveteli/issues/3663) and while the link to the blog is in the main navigation, I imagine it's not massively frequented.

MVP

We have some custom code on AskTheEU to import the blog posts to the front page. At a minimum we should build this into Alaveteli.

Screenshot 2021-10-20 at 16 35 39

Blog posts everywhere

I also want to be able to dynamically pull in blog posts in appropriate places. For example, on /pro/pages/researchers I might want to import case studies tagged #researchers.

On authority pages, we might key the imported posts based on the public body category or tags. For example, when visiting a "school" body, the case study about asbestos in schools might be pulled in.

Screenshot 2022-12-13 at 11 24 31

Pulling in tags

We should build a generic helper that imports posts with given tags. This should only attempt to fetch content if the BLOG_FEED is configured.

blog_posts(tagged: %w[case-study campaigning])

Let's start with fetching the blog title, url, thumbnail and an excerpt of the body content.

In the school example above, the code might be like:

<!-- a random post that matches the body category -->
<div class="sidebar">
  <% blog_posts(tagged: @public_body.categories.map(&:category_tag)).sample(1).each do |post| %>
    <%= render partial: 'blog/post_sidebar', post: post %>
  <% end %>
</div>

<!-- 3 most recent posts tagged "case-study" that also match any of the body tags  -->
<div class="sidebar">
  <% blog_posts(tagged: @public_body.tags.map(&:name) + ['case-study']).take(3).each do |post| %>
    <%= render partial: 'blog/post_sidebar', post: post %>
  <% end %>
</div>

The same principle would apply for request tags.

Rabbit holes to be careful of

I don't know how possible this is with Wordpress – worst case scenario is that we only allow a single tag as an argument. Seems like creating an RSS feed for posts that match all given tags is possible in Wordpress, but to dynamically pull in content based on record tags (see below) we might need some options like:

blog_posts(match_all_tags: %w[case-study whatdotheyknow], match_any_tag: %w[school wales climate])

I think you can separate tags by commas in wordpress to do an OR: https://www.mysociety.org/tag/components,councils/ and with a + to do an AND: https://www.mysociety.org/tag/westminster+councils/

https://blogs.mysociety.org/internal/2022/03/02/foi-case-studies/#comment-2096

The key aim is getting as as relevant as content as possible without outsized effort. It's not the end of the world if the content isn't quite as related as we'd like. It's a huge improvement that it's there in the first place.

Locations

The key places for rendering these case study snippets are:

Some other places we could consider, but not critical, are:

Dealing with failure

The imported content should be cached for a few days – these don't have to be "fresh" content all the time. We should also have a fallback mechanism that allows the page to render even if there's some connectivity issue with the blog.

If there's no related content, we don't have to render anything. Treat this more like "progressive enhancement". It's great if they're there, but no problem if there's nothing relevant to show.

garethrees commented 2 years ago

Seems like creating an RSS feed for posts that match all given tags is possible in Wordpress.

gbp commented 1 year ago

Missing from the default WordPress feed is:

Pagination

There is no way to fetch older articles. Current feed contains 10 items. We could increase this in the WordPress admin but given the amount of posts we have already we probably shouldn't do this. It could mean people who use our feeds would get a deluge of posts.

Images

The default WordPress feed doesn't include links to the header image, although we have prior art customising the RSS item and adding images to the feed. It's not realistic that we can ask re-users make the same changes.

Re-users might be using different software so we should look at a more generic process.

Realistic we're not going to be able to load the all the content needed dynamically and we will need to look at fetching multiple pages to get everything we need. This means we will need to cache the content outside the web request flow. This is different to how the custom AskTheEU code is currently working.

Possible approach

  1. Fetch the blog feed
  2. Loop through each feed post
  3. Fetch post HTML from the post link
  4. Extract data (link, title, last updated at, tags, image, ...) from OpenGraph meta tags if available
  5. Fallback to other post HTML tags or feed post tags if possible
  6. Periodically recheck post links for updates or 404 and update cached content or remove completely.

I'm envisioning storing this in a Rails model rather than temporarily caching in the Rails cache. Why? 1. The amount of data be aggregated from multiple links, 2. easier retrieving the data EG. ExternalContent.with_tags etc 3. storing images in ActiveStorage and being able to resize images for the appropriate layout. 4. Possible integration in the future with improving how Citation are displayed.

garethrees commented 1 year ago

There is no way to fetch older articles

Should paginatable with ?paged=2 (via https://blogs.mysociety.org/internal/2022/03/02/foi-case-studies/#comment-2096).

The default WordPress feed doesn't include links to the header image

Ah, annoying. Let's not worry about images for now then. We can either find a generic blog/article icon, or use other UI/styling to indicate that these are articles rather than FOI requests.

Possible approach…

I appreciate the downside in pulling in the feed content dynamically, but the advantage it does have is that we don't need to worry about keeping it up to date. We just pluck from the recent articles that roughly match, and cache it for a while so that other similar requests are quick. If we can't show anything (feed failure, no relevant content), we just don't.

That said, I don't feel strongly about how we get and store the content, and I appreciate that having a database record would make it easier to link up (e.g. @info_request.related_content). Now that we have background jobs a lot of this potentially becomes easier? Perhaps we should create a BlogFeed model (given that's the current config name) that handles the fetching, and then a BlogFeed::Item for the cached content if we take that approach?

I don't think we need to store the entire HTML content/images locally – we're only linking out to the blog post's URL, not rendering the whole article within Alaveteli. If it's easier to just dump the entire <item> XML into the DB and create some accessors to pull out the few bits we're interested in, then that's okay.

Not super bothered about periodic checks for changed content or 404s – certainly something to leave until the end if we have time. If we spot a problem we can just update or delete from the console. I imagine this will be quite rare. A periodic import of new records sounds fine.

gbp commented 1 year ago

There is no way to fetch older articles

Should paginatable with ?paged=2 (via https://blogs.mysociety.org/internal/2022/03/02/foi-case-studies/#comment-2096).

Ah. It would be great if that was documented - I couldn't find anything! Other blog platforms do pagination in a different way. I'm not sure we can rely on them supporting the paged param. I was expecting to see next prev, first, last meta tags. Maybe I should focus on WordPress and WDTK by just implementing this in our theme for now.

The default WordPress feed doesn't include links to the header image

Ah, annoying. Let's not worry about images for now then. We can either find a generic blog/article icon, or use other UI/styling to indicate that these are articles rather than FOI requests.

Okay. No image for now will let the designers come up with the best way to display these.

garethrees commented 1 year ago

Ah. It would be great if that was documented

Ah is this not a standard wordpress thing?

Maybe I should focus on WordPress and WDTK by just implementing this in our theme for now

I think implement in core – but yeah, assume Wordpress for now as that's the only "officially supported" blog feed.

gbp commented 1 year ago

Even if we just support WordPress I'm not sure how we're going to implement this without needing to override the theme.

With the mySociety WP blog and how we're got it setup, might not be how other use it.

For example we're using both categories and tags (yes WordPress has both). For WDTK would need to filter by category=whatdotheyknow and then tag=case-studies or whatever.

If you look other WP blogs - say the AskTheEU one - they only seem to be using categories. And I bet there will be others that just use tags.

Somewhere we're going to have to specify how to get the correct feed for each location we want blog content to appear, the BLOG_FEED configuration option isn't going to be enough if we want to do this dynamically.

This again make me leans towards fetching the entire blog feed... oh wait no, the WordPress feed doesn't include tags.

Ugh.

garethrees commented 1 year ago

For WDTK would need to filter by category=whatdotheyknow and then tag=case-studies or whatever…

We already filter by category by only pulling in posts from "whatdotheyknow":

BLOG_FEED: https://www.mysociety.org/category/projects/whatdotheyknow/feed/

the WordPress feed doesn't include tags

Not pulling in the tags is annoying. There are feeds per tag – e.g. https://www.mysociety.org/tag/components,councils/feed/, https://www.mysociety.org/tag/westminster+councils/feed/ but IDK if that's custom to us?

If it's a standard feature, then maybe we could async fetch it with JavaScript (or fire off a job & cache some results in the controller; use JS to load the cached copy?)

This again make me leans towards fetching the entire blog feed

An alternative approach would be to do this just pull out the title & URL of posts, save them in the db and make the model taggable (and while we're at it, prominenceable?) and allow admins to tag up the content?

Screenshot 2023-03-08 at 09 59 48

gbp commented 1 year ago

For WordPress we can do:

category and tags query: https://www.mysociety.org/feed/?category=whatdotheyknow&tag=case-studies category, tag_1 OR tag_2 query: https://www.mysociety.org/feed/?category=whatdotheyknow&tag=case-studies%2Cgovernment category, tag_1 AND tag_2: https://www.mysociety.org/feed/?category=whatdotheyknow&tag=case-studies%20cycling

These also works with the paged param too.

Can also the pretty URL we alread use for BLOG_FEED config with the tags and paged params. EG https://www.mysociety.org/category/projects/whatdotheyknow/feed/?tag=case-studies&paged=2

gbp commented 1 year ago

Using the non-pretty URLs seem to be the way to go. On some other WP blogs the pretty feed URLs are sometimes redirected to other non-feed pretty URLs - probably depending on the setup of pretty URLs.

We have to use the pretty URLs, the non-pretty URLs is doing category OR tag query.