j0k3r / graby

Graby helps you extract article content from web pages
MIT License
367 stars 74 forks source link

Update Graby.php (regex modification) #178

Closed techexo closed 5 years ago

techexo commented 5 years ago

I propose this regex modification to actually clean empty nodes before using php-readability and/or Tidy. This regex specifically exclude the first group to be an iframe (which was the intention according to the comment), but allows the deletion of empty attribute-containing nodes.

With this modification, will be detected and deleted the following examples:

<h2></h2>
<p></p>
<h2 class="align"></h2>
<head></head>

but not

<iframe></iframe>
<iframe src="xxx.html"></iframe>

You can test it here: https://regex101.com/r/uWjsOc/1

It fixes https://github.com/wallabag/wallabag/issues/3730 and make https://github.com/j0k3r/php-readability/issues/39 closable (as I had no proof it was a php-readability issue).

coveralls commented 5 years ago

Coverage Status

Coverage remained the same at 98.627% when pulling 4e711d2e09421d7e90b5a3d28ac1330f6abf0fe3 on techexo:patch-1 into 907e1f774dfec5f22f2b6130f2dce80531df58da on j0k3r:master.

techexo commented 5 years ago

Actually, someone might want to check that it is not a very bad idea! In the present form, it will also delete <script src="..."></script> instructions in header (but I think it is stripped anyway by php-readability). As far as I know, other instructions in header are most of the time "mono-markup" (e.g. <meta .../>, <link ... />)

techexo commented 5 years ago

It seems that my code breaks quite a lot of tests, I will have a look and propose a modification if I can find what went wrong.

j0k3r commented 5 years ago

First rebase against the master. I might have fixed some of failing tests (which might not be related to your changes)

techexo commented 5 years ago

I rebased, but it was a me issue, not a test issue ;) I was allowing for > in extra characters, so consecutive markups were squashed together... It should be OK now with the new commit!

Edit: not that this would also work

$re = '/<(?!iframe)([^>\s]+)[^>]*>(?:\s*(?:<br \/>|&nbsp;|&thinsp;|&ensp;|&emsp;|&#8201;|&#8194;|&#8195;)\s*)*<\/\1>/m';
techexo commented 5 years ago

@j0k3r I analyzed a bit more the original regex, and I am pretty confident that it doesn't discriminate at all the iframe with the group exclusion [^iframe|>], but with the fact that iframes markups always have a src="" following. I have commited the simpler version mentionned above that allows for the deletion of all empty nodes except iframe. The only side-effect I can see is due to the deletion of script markups, that are in any case deleted by php-readability, so that shouldn't be an issue.

techexo commented 5 years ago

Travis tests not passing because of a deprecation notice with php7.1, not because of the commits.