Closed techexo closed 5 years ago
Actually, someone might want to check that it is not a very bad idea!
In the present form, it will also delete <script src="..."></script>
instructions in header (but I think it is stripped anyway by php-readability). As far as I know, other instructions in header are most of the time "mono-markup" (e.g. <meta .../>
, <link ... />
)
It seems that my code breaks quite a lot of tests, I will have a look and propose a modification if I can find what went wrong.
First rebase against the master. I might have fixed some of failing tests (which might not be related to your changes)
I rebased, but it was a me issue, not a test issue ;)
I was allowing for >
in extra characters, so consecutive markups were squashed together... It should be OK now with the new commit!
Edit: not that this would also work
$re = '/<(?!iframe)([^>\s]+)[^>]*>(?:\s*(?:<br \/>| | | | | | | )\s*)*<\/\1>/m';
@j0k3r I analyzed a bit more the original regex, and I am pretty confident that it doesn't discriminate at all the iframe
with the group exclusion [^iframe|>]
, but with the fact that iframes markups always have a src=""
following.
I have commited the simpler version mentionned above that allows for the deletion of all empty nodes except iframe. The only side-effect I can see is due to the deletion of script markups, that are in any case deleted by php-readability, so that shouldn't be an issue.
Travis tests not passing because of a deprecation notice with php7.1, not because of the commits.
I propose this regex modification to actually clean empty nodes before using php-readability and/or Tidy. This regex specifically exclude the first group to be an
iframe
(which was the intention according to the comment), but allows the deletion of empty attribute-containing nodes.With this modification, will be detected and deleted the following examples:
but not
You can test it here: https://regex101.com/r/uWjsOc/1
It fixes https://github.com/wallabag/wallabag/issues/3730 and make https://github.com/j0k3r/php-readability/issues/39 closable (as I had no proof it was a php-readability issue).