modxcms / revolution

MODX Revolution - Content Management Framework
https://modx.com/
GNU General Public License v2.0
1.36k stars 529 forks source link

Import HTML regex breaks if newlines in title #12937

Closed ghost closed 8 years ago

ghost commented 8 years ago

Summary

When importing an HTML document the regex that parses the title tags breaks if there are newlines in the title.

Step to reproduce

Add some extra lines to the title tag in an HTML page's head. Import the HTML document.

Observed behavior

The title will not be used as the resource's pagetitle, it will default to the base filename.

Expected behavior

It should use the HTML title. In order to do so the preg_match needs to use the s modifier on line 104 of core/model/modx/import/modstaticimport.class.php:

if (preg_match("/<title>(.*)<\/title>/siU", $file, $matches)) {

Now it will use the HTML title as expected, even with extra lines in the title.

Environment

N/A

Jako commented 8 years ago

It would be better to change the regex to trim the whitespace away.

if (preg_match("/<title>\s*(.*)\s*<\/title>/siU", $file, $matches)) {
ghost commented 8 years ago

Whatever works. Regular expressions are not my strong point.

Mark-H commented 8 years ago

Fixed in #12941