humanmade / WordPress-Importer

In-development rewrite of the WordPress (WXR) Importer
Other
358 stars 63 forks source link

Some random error that happened once out of importing 30 files. #54

Open setroot opened 8 years ago

setroot commented 8 years ago

Warning: XMLReader::expand(): /homepages/20/d627224091/htdocs/clickandbuilds/something/wp-content/uploads/2016/05/something.wordpress.2016-05-17.008.xml_-1.txt:4783: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0x05 0x32 0x2E 0x30 in /homepages/20/d627224091/htdocs/clickandbuilds/something/wp-content/plugins/WordPress-Importer-master/class-wxr-importer.php on line 215

Warning: XMLReader::expand(): /chantalmichel_2002.jpg"><img class="alignnone size-full wp-image-13392" title=" in /homepages/20/d627224091/htdocs/clickandbuilds/something/wp-content/plugins/WordPress-Importer-master/class-wxr-importer.php on line 215

Warning: XMLReader::expand(): ^ in /homepages/20/d627224091/htdocs/clickandbuilds/something/wp-content/plugins/WordPress-Importer-master/class-wxr-importer.php on line 215

Warning: XMLReader::expand(): An Error Occurred while expanding in /homepages/20/d627224091/htdocs/clickandbuilds/something/wp-content/plugins/WordPress-Importer-master/class-wxr-importer.php on line 215

Warning: Invalid argument supplied for foreach() in /homepages/20/d627224091/htdocs/clickandbuilds/something/wp-content/plugins/WordPress-Importer-master/class-wxr-importer.php on line 603

swissspidy commented 8 years ago

Could you share the file you tried to import with us by chance if it doesn't contain confidential data?

setroot commented 8 years ago

I installed it and then tried to run it then it showed.

Even with the error, I was able to import all data without any problem.

I had to import nearly 20 xml files, I would say 1/3 that error would show but it doesn't effect the data being imported.

swissspidy commented 8 years ago

I understand, thanks for the additional information. Still, having an actual test file would make things easier for testing.

Just in case I need it, which WordPress & PHP versions are you running?

setroot commented 8 years ago

Wordpress 4.5.2 PHP 5.6

Again, all what I did was upload the plugin, activate it, and try to launch it. That's when that error appeared.

Does the plugin keep logs anywhere? I can retrieve those if it does.

franz-josef-kaiser commented 8 years ago

The following error message comes from this call. The $reader is fetched via \WRX_Importer::get_reader( $file ), meaning it's an instance of \XMLReader.

Warning: XMLReader::expand(): /…/example.xml_-1.txt:4783: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0x05 0x32 0x2E 0x30 in /…/WordPress-Importer-master/class-wxr-importer.php on line 215

Looking at the range of characters, those are definitely outside the XML supported scope and range.

[…] any Unicode character, excluding the surrogate blocks, FFFE, and FFFF

and

Document authors are encouraged to avoid "compatibility characters" […] characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:

In short, this is either a DB related problem or some copy/paste related problem.

The real problem is a missing error check in \WXR_Importer::parse_post_node( $node ).. Not only is there no typehinting in the function, so it takes everything instead of \DOMNode instances, but there's no check if the return value is a boolean FALSE to indicate that there was an error. I'd suggest to change the \WXR_Importer::parse_post_node( $node ) method to the following to avoid the typehinting and stay with WP core crap in line:

protected function parse_post_node( $node ) {
    $error = '';
    // …
    if ( ! empty( $error ) ) {
        return new \WP_Error( 'wxr_importer.cannot_parse', __( 'Imported XML contained invalid characters', 'wordpress-importer' ) );
    }
}

And when someone already is taking on this, the following 100+ lines in the switch could drastically be simplified:

$key = $child->tagName;
// Handle special cases:
if ( in_array( $key, [
    'dc:creator',
    'content:encoded',
    // …
] )
    // Handle reformatting of key

$data[ $key ] = $child->textContent;

Sidenotes:

rmccue commented 8 years ago

And when someone already is taking on this, the following 100+ lines in the switch could drastically be simplified:

FWIW, the performance degradation here was significant last time I checked. Having a switch (and hence, calculated at compile time) improved performance significantly.

franz-josef-kaiser commented 8 years ago

[…] the performance degradation here was significant last time I checked

Point taken. While I do not get why there should be a difference during compile time with a switch, it's by far the most unimportant part of my comment, hence the Sidenote flag.

Any idea on which route to take with invalid characters?

rmccue commented 8 years ago

Haven't had a chance to review the code again yet, just wanted to note that there :) I'll try and take a look in the morning.

franz-josef-kaiser commented 8 years ago

Thinking out loud: Is there a XSD schema link attached to an exported XML? If yes, maybe the schema should get set and the parser can be set to validate against it using \XMLReader::VALIDATE. Then \XMLReader::setParserProperty() can be set and \XMLReader::isValid() can be used to check the contents. Another option would be to validate the complete file up front.

rmccue commented 8 years ago

Parsing the schema sounds potentially expensive, as it'd require a full runthrough of the file for validation. We should be able to pick this up during the preliminary stage, I'd think; my suspicion is that the reader is in a lax parsing mode right now.

franz-josef-kaiser commented 8 years ago

as it'd require a full run-through of the file for validation

Maybe that should be just a separate step up front, saving both the user as the parser some time.

Also, @rmccue have you noticed the file extension? something.wordpress.2016-05-17.008.xml_-1.txt. Not sure if this is a temp file or really a txt file and @whosjose actually has fiddled with it in a Windows texteditor…

rmccue commented 8 years ago

The UI already does a preliminary parsing stage for this, so that'd require 3 runthroughs of the file. I think that's a little too expensive, but willing to be proven wrong if you can benchmark this with a largish file (my test is ~30MB of XML). :)

franz-josef-kaiser commented 8 years ago

[…] if you can benchmark this with a largish file

Sorry, but my help here can be taken as neighbourhood help. I am not even using the plugin and I really do not intend to do so. Building up test cases and running benchmarks is far beyond what I am able and willing to invest here. Hope you understand my reasoning and can live with what I am able to offer to you guys :)

setroot commented 8 years ago

@franz-josef-kaiser I didn't riddle with the file at all. I've confirmed that the file does indeed end with a .xml.

simos commented 8 years ago

The error message claims

 parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0x05 0x32 0x2E 0x30 in /homepages/20/d627224091/htdocs/clickandbuilds/something/wp-content/plugins/WordPress-Importer-master/class-wxr-importer.php on line 215

however the bytes 0x05 0x32 0x2E 0x30 correspond to a valid string "2.0" (without quotes).

simos commented 8 years ago

I got the same error with Ubuntu 16.04 (PHP7), failing to import the first post with an encoding error.

I then tried with Ubuntu 14.04 (PHP5) and the import (a big one) completed just fine.

I suppose these encoding errors relate to the different versions of the server software, probably PHP7.

Tailzip commented 7 years ago

Same issue with Ubuntu 16.04 (PHP 7.0). Downgrading my PHP version also did the trick! (Ubuntu 14.04 + PHP 5.5)