Open setroot opened 8 years ago
Could you share the file you tried to import with us by chance if it doesn't contain confidential data?
I installed it and then tried to run it then it showed.
Even with the error, I was able to import all data without any problem.
I had to import nearly 20 xml files, I would say 1/3 that error would show but it doesn't effect the data being imported.
I understand, thanks for the additional information. Still, having an actual test file would make things easier for testing.
Just in case I need it, which WordPress & PHP versions are you running?
Wordpress 4.5.2 PHP 5.6
Again, all what I did was upload the plugin, activate it, and try to launch it. That's when that error appeared.
Does the plugin keep logs anywhere? I can retrieve those if it does.
The following error message comes from this call. The $reader
is fetched via \WRX_Importer::get_reader( $file )
, meaning it's an instance of \XMLReader
.
Warning: XMLReader::expand(): /…/example.xml_-1.txt:4783: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0x05 0x32 0x2E 0x30 in /…/WordPress-Importer-master/class-wxr-importer.php on line 215
Looking at the range of characters, those are definitely outside the XML supported scope and range.
[…] any Unicode character, excluding the surrogate blocks, FFFE, and FFFF
and
Document authors are encouraged to avoid "compatibility characters" […] characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:
In short, this is either a DB related problem or some copy/paste related problem.
The real problem is a missing error check in \WXR_Importer::parse_post_node( $node )
.. Not only is there no typehinting in the function, so it takes everything instead of \DOMNode
instances, but there's no check if the return
value is a boolean FALSE
to indicate that there was an error. I'd suggest to change the \WXR_Importer::parse_post_node( $node )
method to the following to avoid the typehinting and stay with WP core crap in line:
protected function parse_post_node( $node ) {
$error = '';
// …
if ( ! empty( $error ) ) {
return new \WP_Error( 'wxr_importer.cannot_parse', __( 'Imported XML contained invalid characters', 'wordpress-importer' ) );
}
}
And when someone already is taking on this, the following 100+ lines in the switch
could drastically be simplified:
$key = $child->tagName;
// Handle special cases:
if ( in_array( $key, [
'dc:creator',
'content:encoded',
// …
] )
// Handle reformatting of key
$data[ $key ] = $child->textContent;
Sidenotes:
And when someone already is taking on this, the following 100+ lines in the switch could drastically be simplified:
FWIW, the performance degradation here was significant last time I checked. Having a switch (and hence, calculated at compile time) improved performance significantly.
[…] the performance degradation here was significant last time I checked
Point taken. While I do not get why there should be a difference during compile time with a switch, it's by far the most unimportant part of my comment, hence the Sidenote flag.
Any idea on which route to take with invalid characters?
Haven't had a chance to review the code again yet, just wanted to note that there :) I'll try and take a look in the morning.
Thinking out loud: Is there a XSD schema link attached to an exported XML? If yes, maybe the schema should get set and the parser can be set to validate against it using \XMLReader::VALIDATE
. Then \XMLReader::setParserProperty()
can be set and \XMLReader::isValid()
can be used to check the contents. Another option would be to validate the complete file up front.
Parsing the schema sounds potentially expensive, as it'd require a full runthrough of the file for validation. We should be able to pick this up during the preliminary stage, I'd think; my suspicion is that the reader is in a lax parsing mode right now.
as it'd require a full run-through of the file for validation
Maybe that should be just a separate step up front, saving both the user as the parser some time.
Also, @rmccue have you noticed the file extension? something.wordpress.2016-05-17.008.xml_-1.txt
. Not sure if this is a temp file or really a txt
file and @whosjose actually has fiddled with it in a Windows texteditor…
The UI already does a preliminary parsing stage for this, so that'd require 3 runthroughs of the file. I think that's a little too expensive, but willing to be proven wrong if you can benchmark this with a largish file (my test is ~30MB of XML). :)
[…] if you can benchmark this with a largish file
Sorry, but my help here can be taken as neighbourhood help. I am not even using the plugin and I really do not intend to do so. Building up test cases and running benchmarks is far beyond what I am able and willing to invest here. Hope you understand my reasoning and can live with what I am able to offer to you guys :)
@franz-josef-kaiser I didn't riddle with the file at all. I've confirmed that the file does indeed end with a .xml.
The error message claims
parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0x05 0x32 0x2E 0x30 in /homepages/20/d627224091/htdocs/clickandbuilds/something/wp-content/plugins/WordPress-Importer-master/class-wxr-importer.php on line 215
however the bytes 0x05 0x32 0x2E 0x30 correspond to a valid string "2.0" (without quotes).
I got the same error with Ubuntu 16.04 (PHP7), failing to import the first post with an encoding error.
I then tried with Ubuntu 14.04 (PHP5) and the import (a big one) completed just fine.
I suppose these encoding errors relate to the different versions of the server software, probably PHP7.
Same issue with Ubuntu 16.04 (PHP 7.0). Downgrading my PHP version also did the trick! (Ubuntu 14.04 + PHP 5.5)
Warning: XMLReader::expand(): /homepages/20/d627224091/htdocs/clickandbuilds/something/wp-content/uploads/2016/05/something.wordpress.2016-05-17.008.xml_-1.txt:4783: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0x05 0x32 0x2E 0x30 in /homepages/20/d627224091/htdocs/clickandbuilds/something/wp-content/plugins/WordPress-Importer-master/class-wxr-importer.php on line 215
Warning: XMLReader::expand(): /chantalmichel_2002.jpg"><img class="alignnone size-full wp-image-13392" title=" in /homepages/20/d627224091/htdocs/clickandbuilds/something/wp-content/plugins/WordPress-Importer-master/class-wxr-importer.php on line 215
Warning: XMLReader::expand(): ^ in /homepages/20/d627224091/htdocs/clickandbuilds/something/wp-content/plugins/WordPress-Importer-master/class-wxr-importer.php on line 215
Warning: XMLReader::expand(): An Error Occurred while expanding in /homepages/20/d627224091/htdocs/clickandbuilds/something/wp-content/plugins/WordPress-Importer-master/class-wxr-importer.php on line 215
Warning: Invalid argument supplied for foreach() in /homepages/20/d627224091/htdocs/clickandbuilds/something/wp-content/plugins/WordPress-Importer-master/class-wxr-importer.php on line 603