lonekorean / wordpress-export-to-markdown

Converts a WordPress export XML file into Markdown files.
MIT License
1.07k stars 216 forks source link

URI malformed Error #120

Closed h2dcc closed 2 months ago

h2dcc commented 3 months ago
Something went wrong, execution halted early.
URIError: URI malformed
    at decodeURIComponent (<anonymous>)
    at getPostSlug (D:\wordpress-export-to-markdown-2.3.7\src\parser.js:100:9)
    at D:\wordpress-export-to-markdown-2.3.7\src\parser.js:71:12
    at Array.map (<anonymous>)
    at D:\wordpress-export-to-markdown-2.3.7\src\parser.js:64:5
    at Array.forEach (<anonymous>)
    at collectPosts (D:\wordpress-export-to-markdown-2.3.7\src\parser.js:61:12)
    at Object.parseFilePromise (D:\wordpress-export-to-markdown-2.3.7\src\parser.js:22:16)
    at async D:\wordpress-export-to-markdown-2.3.7\index.js:15:16

Hello, I recently encountered an error while using this tool. I suspect it's because there are several hundred articles in a directory within my WORDPRESS database that are actually published in a Twitter-style format, with titles exceeding more than 200 characters and containing many special symbols. I'm wondering if there's a solution to this issue.

lonekorean commented 3 months ago

If your export file doesn't have sensitive info in it and you're comfortable sharing it, could you email it to me? That would help me find which URLs are causing problems. My email is on my GitHub profile page.

h2dcc commented 3 months ago

If your export file doesn't have sensitive info in it and you're comfortable sharing it, could you email it to me? That would help me find which URLs are causing problems. My email is on my GitHub profile page.

thx, I have tested all articles within WordPress. Through a process of elimination, I have successfully identified the specific data that was causing the "URI malformed Error" .

test7.zip

lonekorean commented 2 months ago

Thank you, I was able to see the issue clearly with your test7.xml file.

It seems to be an encoding + truncation issue. The post_name is being truncated at an unfortunate spot, which breaks the encoding.

Specifically, JavaScript doesn't like this:

qq%e4%b8%8a%e5%a5%bd%e5%8f%8b%e5%8f%8a%e7%be%a4%e5%a4%aa%e5%a4%9a%ef%bc%8c%e5%b7%a8%e8%80%97%e6%89%8b%e6%9c%ba%e6%b5%81%e9%87%8f%e3%80%81%e7%94%b5%e8%84%91%e5%86%85%e5%ad%98%e7%ad%89%e5%90%84%e7%a7

but if I add %8d to the end, JavaScript will happily decode it to qq上好友及群太多,巨耗手机流量、电脑内存等各种 which matches (part of) the title.

I am not sure how your WordPress data got into that state, but it doesn't seem right that it would blindly truncate an encoded string like that.

Anyway, I hope that because you know what the bad data was, you were able to manually edit and move past it. I am unsure if this is something I will fix on my side, since the problem seems to be in the malformed data itself and this is the first and only report I've seen of this.

Regardless, thank you for letting me know so I am aware!

h2dcc commented 2 months ago

Thank you for helping to identify the root cause of the issue.

My WordPress database is an old one, dating back to around 2010, when I was using the P2 theme created by Automattic (https://wordpress.org/support/theme/p2/reviews/), which allowed posts to be directly published on the blog homepage just like on Twitter and could synchronize with Twitter.

The problem is that all the content was stored in the "post title" field in the database, and there were no formatting requirements at the time, which led to this peculiar situation.

Upon discovering that this abnormal data was preventing the conversion, I deleted it from the WordPress dashboard, and subsequently managed to convert the remaining content into MD files.

In fact, around 2014, I had considered saving the content of this old website as a visual file, but there didn't seem to be a convenient method available then.

In any case, I am extremely grateful for your work, which has enabled me to import my old articles into Obsidian by converting them into MD files.