podlove / podlove-publisher

Podlove Podcast Publisher for WordPress
https://wordpress.org/plugins/podlove-podcasting-plugin-for-wordpress/
MIT License
299 stars 84 forks source link

Errors extracting shownotes from HTML #1149

Open gpapp opened 4 years ago

gpapp commented 4 years ago

Multiple errors occur when trying to process longer articles like: https://podcast.itworks.hu/jesus-caesar-napoleon-a-zartosztalyon/

Expected behavior

All links that are in the article added to the shownotes using the text of the links in their description

Actual behavior

Processing stops on the --more-- tag. Probably not the full text, but the excrept is used for processing. If the tag is removed the links are extracted from the text, but the UTF-8 encoded strings are used as an 8-bit string changing all local characters to garbage.

System information (see Podlove > Support menu)

Website                    https://podcast.itworks.hu
PHP Version                7.3.22
WordPress Version          5.5.1
WordPress Theme            myPortfolio v1.0.6
Active Plugins             
           - Akismet Anti-Spam v4.1.6
           - Download Free Images v1.3.0
           - Site Kit by Google v1.16.0
           - Jetpack by WordPress.com v8.9.1
           - Limit Login Attempts Reloaded v2.15.2
           - Podlove Podcast Publisher v3.0.4
           - Podlove Web Player v5.2.9
           - NextScripts: Social Networks Auto-Poster v4.3.18
           - Pexels: Free Stock Photos v1.2.2
WordPress Database Charset utf8
WordPress Database Collate 
Publisher Version          3.0.4
Web Player Version         player_v5
Twig Version               2.12.5
Monolog Version            1
open_basedir               /var/www/vhosts/itworks.hu/:/tmp/
curl Version               7.47.0
iconv                      available
simplexml                  ok
max_execution_time         30
upload_max_filesize        2M
memory_limit               256M
disable_classes            
disable_functions          eval, exec, passthru, shell_exec, system, proc_open, popen, curl_multi_exec, parse_ini_file, show_source
permalinks                 ok (/%postname%/)
podlove_permalinks         ok
podcast_settings           ok
web_player                 ok
podlove_cache              on
assets                     
  - mp3    audio/mpeg       https://podcast.itworks.hu/feed/mp3/
  - ogg    audio/ogg        https://podcast.itworks.hu/feed/ogg/
  - chapters.txttext/plain       no feed
cron                       ok
duplicate_guids            ok

0 errors
1 NOTICE (no dealbreaker, but should be fixed if possible): 
- The PHP setting "open_basedir" is not empty. This is incompatible with curl, a library required by Podlove Publisher. We have a workaround in place but it is preferred to fix the issue. Please ask your hoster to unset "open_basedir".
gpapp commented 3 years ago

To get the content of the post properly, one must use the query. To fix the character coding issues the parsed HTML needs a meta tag to specify the encoding used.

Instead of the line in the rest_api.php in the import_html($request) function

                $html = $request['html'] ?? get_the_content(null, false, $post_id);

This dirty hack can be used

        $query = get_post($post_id); 
        $html = '<?xml encoding="utf-8" ?>'.apply_filters('the_content', $query->post_content);

N.B. This hack knowingly does NOT care about the encoding of your site and forces UTF-8.

gpapp commented 3 years ago

Encoding issue fixed in v3.5.0, but the original issue remains, /more tags break the_post_content evaluation

There are bogus error messages due to the libXML validation, that can be avoided by the following as suggested here

// fix html5/svg errors
libxml_use_internal_errors(true);

The post in entirety can not be fetched with

$html = $request['html'] ?? get_the_content(null, false, $post_id);

if it contains the /more tag. Using the post already loaded in the episode variable, it is better to use:

$html = $request['html'] ?? apply_filters('the_content', $episode->post()->post_content);
gpapp commented 2 years ago

This bug still persists in 3.8.1. Quite sad, because otherwise this is an excellent feature!