Option to pull social images from content

sybrew / the-seo-framework

The SEO Framework WordPress plugin.

https://theseoframework.com/

GNU General Public License v3.0

420 stars 47 forks source link

Option to pull social images from content #268

Closed sybrew closed 5 years ago

sybrew commented 6 years ago

From marchino65's concern:

Images aren't pulled from the content. This is the default and intended behavior.

We could pull images from the content (WordPress has functions for it), but we only want to implement this optionally.

I'm not sure if Twitter supports this behavior. Facebook (through Open Graph) certainly does.

sybrew commented 6 years ago

Proposed method:

/**
 * Returns the social image URL from the post content.
 *
 * @since 3.1.0
 *
 * @param int $id The post ID. Required.
 * @param array $args The image args.
 * @param bool $set_og_dimensions Whether to set Open Graph image dimensions.
 * @return string The attachment URL.
 */
public function get_social_image_url_from_post_content( $id, $args = [], $set_og_dimensions = false ) {

    $images = \get_attached_media( 'image', $id ) ?: [];
    $_id = key( $images );
    $src = '';

    if ( isset( $images[ $_id ]->ID ) ) {

        $args = $this->reparse_image_args( $args );
        $args['get_the_real_ID'] = true;

        $src = $this->parse_og_image( $images[ $_id ]->ID, $args, $set_og_dimensions );

        if ( $src && $this->matches_this_domain( $src ) )
            $src = $this->set_preferred_url_scheme( $src );
    }

    return $src;
}

Now, get_attached_media() seriously harms performance. So I'm hesitant on adding it like this.

It'd be better scraping the images from content via regex, where we obtain the first image (or more, in the future). This is at least a thousand times faster, as we don't have to fetch (all) the attachment posts from the database.

With this, we must be careful about pulling smilies.

lifeforceinst commented 6 years ago

You also need to consider that the URL from a post could be absolute or relative URL - remember the golden rule of programming is not to assume.

Here is an example of how I retrieved images in one of my themes, which determines if there is a featured image, if not, then tried to extract image from post using preg_match and if no image can then apply a default image (such as the fallback image if set).

 $thumbnail    = wp_get_attachment_image_src (get_post_thumbnail_id( $post->ID), 'thumbnail');
   if (empty($thumbnail)) {
    // There is no featured image, so check for embedded image
    $theimage = preg_match_all('/<img.+src=[\'"]([^\'"]+)[\'"].*>/i', $post->post_content, $matches);
    $firstimg = $matches[1][0];
    if (empty($firstimg)) {
      $thumbnail = array ('0' => $the_fallback_image);
    } else {
      // Need to handle relative or absolute image URI, by using parse_url to get the image path
     // Then add this to the home URI
      $fileimg  = parse_url ($firstimg, PHP_URL_PATH);
      $firsturl = home_url() . $fileimg;
      $thumbnail = array ('0' => $firsturl);
    }

}

sybrew commented 6 years ago

I think that the golden rule is to assume this: There are users out there doing weird stuff 😅 It's what's called defensive programming, and it's applied throughout the plugin.

Thank you for the regex and the example 👍 It's in the direction of what I had in mind!

Now, the XML standard allows omission of the '" tags, so this is a perfectly valid img tag: <img src=example.com/image.jpg />

I won't add this enhancement/feature to v3.1, so you might find the following snippet useful.

It'll only pass URLs that are matching the same domain (or are relative). Now, it won't detect broken HTML (missing closing quotes or having extraneous spaces in the URL)--it's not a DOM parser, after all.

Nevertheless, here it is:

add_filter( 'the_seo_framework_og_image_after_featured', function( $image, $post_id ) {
    // Someone else already supplied an image in this filter.
    if ( $image ) return $image;

    $post = get_post( $post_id );
    if ( isset( $post->post_content ) ) {
        preg_match_all(
            '/<img[^>]+src=(\"|\')?([^\"\'>\s]+)\1?.*?>/mi',
            $post->post_content,
            $matches,
            PREG_SET_ORDER
        );

        $tsf = the_seo_framework();
        foreach ( $matches as $match ) {
            if ( empty( $match[2] ) ) continue;

            if ( $tsf->matches_this_domain( $match[2] ) ) {
                $image = $tsf->set_url_scheme( $match[2] );
                break;
            } elseif ( 0 === strpos( $match[2], '/' ) && 0 !== strpos( $match[2], '//' ) ) {
                $url   = trailingslashit( $tsf->get_home_host() ) . ltrim( $match[2], ' /' );
                $image = $tsf->set_url_scheme( $url );
                break;
            }
        }
    }

    return $image;
}, 10, 2 );

I tested it, and it works great. It adds 0.00009s±~200% of load time, which is negligible for a feature like this. The prior proposed function added roughly 0.003s of load time, which is twice the plugin's runtime.

Let me know if this works for you 😄, and I'll implement it for v3.2.

lifeforceinst commented 6 years ago

Hi thanks for the suggestion, I have two issues though:

the_seo_framework_og_image_after_featured does not fire if SEO fallback image is set. I have spent a few hours trying to get this to work and found that in no circumstance would the filter fire. I then searched through the SEO framework code to find the use of the_seo_framework_og_image_after_featured.

This only appears to be used in /autodescription/inc/classes/generate-image.class.php, but when looking through this code there seems to be huge amounts of the code which is commented out which seem to be caused by some statements starting with //* possibly instead of // (for which there were 36 instances) . Is this correct as this is what is in 3.0.6? Then after doing a lot of debugging, I found that fallback_1 never fires if the SEO fallback image is set. This means if you have a fallback image configured (which should be the last resort), then this takes precedence over any options later on the list. Mu recommendation at a mimimum would be to swap Fetch 4 & 5 (as example below). Doing so means that the_seo_framework_og_image_after_featured will be processed even if there is a fallback SEO image defined.

The logic did not pick up a relative image URL. The logic you have included works in most cases, but I found that it did not work when a relative URL was included which did not include the protocol portion. As an example if the post contained the text , then the preg_match_all returns no $matches. Then even if I quickly changed the pre_match_all to something quick and dirty, then following parts of the logic do not seem to work. Thanks again for the collaboration. If you want to conduct a more direct email then you can use [sybrew: moderated email address].

[sybrew: moderated email footer] [sybrew: removed unintentional profile mentions] [sybrew: removed unreadable code copies, use tag/blob links instead]

sybrew commented 6 years ago

Hi @lifeforceinst,

Could you use GitHub in the browser to reply to issues? Your email client doesn't seem to support HTML (and in extend Markdown) and has odd breaking points, which are making your replies unreadable.

The filter's meant for testing purposes. The placement of this feature is to be determined, and it'll be before the fallback filter fires. Thanks for highlighting that 👍

There's no commented out code. It's PHPDOC, which helps me (and others) understand the code better in the future.

The code doesn't pick up on <img src="/uploads/whiteblue-150x150.png"> because it's not valid XHTML, it should have a closing tag: <img src="/uploads/whiteblue-150x150.png" />

However, since we're in the HTML5 era, the closing tag and space may be omitted. Thanks for looking into that 😄 I'll update the snippet to support HTML5, too.

sybrew commented 6 years ago

Updated the regex.

From:

<img.*?\bsrc=(\'|\")?(.*?)(\1|\s).*?\/>

To:

<img[^>]+src=(\"|\')?([^\"\'>\s]+)\1?.*?>

Now, the PHP method will always look for a domain or a slash at the start of each match. Because WordPress doesn't upload files where posts are located, this is fine.

lifeforceinst commented 6 years ago

Thanks for the advice about the email client,i will instead add directly here.

_The new preg_matchall rules works like a charm, I created about 200 different combinations of embedded images and it worked for all permutations. So yes using preg_match_all ('/<img[^>]+src=(\"|\')?([^\">\s]+)\1?.*?>/mi', $post->post_content, $matches, PREG_SET_ORDER); is a good way to go with trying to deal with all the variations of supporting new HTML5 tags and also dealing with things which may be legacy HTML.

Can you confirm that the file /autodescription/inc/classes/generate-image.class.php is meant to have //* instead of // as the 36 instances of this seemed to result in a large amount of code being unused?

Would you be looking to change the order of the fallback filter firing so that fallback_1 fires as match 4 ahead of 5 Fetch image from SEO settings. Example follows

Last item I am looking at the final logic elseif ( 0 === strpos( $match[2], '/' ) && 0 !== strpos( $match[2], '//' ) ) { this does not seem to be working root relative URLs, if I shortened this to just elseif ( 0 === strpos( $match[2], '/' ) ) { then that worked.

lifeforceinst commented 6 years ago

FYI here is the latest filter which I am using and which is working once the fallback firing order is addressed.

// Adds a filter which places og metadata for non-features images.
add_filter ('the_seo_framework_og_image_after_featured', function ($image, $post_id) {
  if ($image) { return $image; }
  // No image is set, so try to find one in the post content.
  $post = get_post ($post_id);
  if ( isset ($post->post_content) ) {
    preg_match_all ('/<img[^>]+src=(\"|\')?([^\">\s]+)\1?.*?>/mi', $post->post_content, $matches, PREG_SET_ORDER);
    $tsf = the_seo_framework();
    foreach ($matches as $match) {
      if ( empty( $match[2] ) ) continue;
      if ( $tsf->matches_this_domain( $match[2] ) ) {
        $image = $tsf->set_url_scheme( $match[2] );
        break;
      } elseif ( 0 === strpos( $match[2], '/' ) ) {
        $url   = trailingslashit( $tsf->get_home_host() ) . ltrim( $match[2], ' /' );
        $image = $tsf->set_url_scheme( $url );
        break;
        }
      }
/* Alternative code for matching images.
    foreach ($matches as $postimg) {
      if ( empty ($postimg [2]) ) continue;
      $image = home_url() . parse_url ($postimg [2], PHP_URL_PATH);
      break;
    }
*/
  }
//  echo '<pre>THE Image is: ' . $image . '</pre>';
  return $image;
}, 10, 2 );

sybrew commented 6 years ago

I'm glad everything's in working (intended) order 😄

The root relative URLs are ignored, and no condition is necessary to be added. I commented on this at the end of https://github.com/sybrew/the-seo-framework/issues/268#issuecomment-410283832

To explain: When you add an image to the root location of a post, the post won't show up, as WP_Rewrite can't reach it.

Off topic:

Regarding the filters:
1. I'm not going to change the order of the filters. It might destroy someone's site.
2. I'm not planning on adding new filters, either. In fact, I'm moving towards removing API reliance: everything should be possible via options or extensions. My goal is that there's no longer a need to fiddle with code.

Whenever a line starts with # or //, it's an inline comment, regardless of the appended characters. I use various forms of documenting code:

// general inline comment
//* comment relating to the code below. Used as an alternative to a codeblock: /* ... */
//! caution/should-change/planned-change: probable bug, difficult code that looks like a bug, or inconsistency ahead
//? explanatory: this is odd, this is why we did this anyway...

p.s. I hid your latest comment because it was a copy of my snippet, with extra debugging output added. This will confuse users that are stumbling upon it.

bbbjames commented 2 years ago

👋 Hello

Not sure this is the right place to post; I'm new to the seo framework so please forgit me... (legit I said forgit).

I can use > get_the_post_thumbnail_url( get_the_ID(), 'original' );

Is there such a function I can use for the social url?

I see things like this thread, and; $data[ $g ] = current( $tsf->get_generated_image_details( $_generator_args, true, 'social', true ) )['url'] ?? '';

I havent yet found a simple function to pull the social image url.

Does it exist? Please tell me. Thank you kindly. Bob 👍

sybrew commented 2 years ago

Hi @init-bobJames 👋,

I'm rummaging my backlog and found your comment. Sorry for having gitnored it.

I havent yet found a simple function to pull the social image URL.

That's the simplest method -- but I can simplify it further for you:

$image = tsf()->get_image_details_from_cache( true )['url'] ?? '';

Cheers :)

If you wish to receive support more easily, please see https://tsf.fyi/support. I normally do not handle support on GitHub and let comments just idle for future reference.