libertysoft3 / reddit-html-archiver

archive reddit data as offline friendly web pages
MIT License
167 stars 27 forks source link

invalid continuation byte #26

Open MattPeterson0 opened 3 years ago

MattPeterson0 commented 3 years ago

Hello. This program is great. Setting it all up on Windows with no previous Python experience was an adventure, but once I got everything in place, it's fantastic. Thank you very much for making this.

Recently I've been getting a problem with write_html.py with a particular subredddit capture. Here's the error:

Traceback (most recent call last): File "write_html.py", line 774, in generate_html(args.min_score, args.min_comments, hide_deleted_comments) File "write_html.py", line 119, in generate_html write_link_page(subs, l, sub, hide_deleted_comments) File "write_html.py", line 288, in write_link_page '###BODY###': snudown.markdown(c['body'].replace('>','>')),

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 62: invalid continuation byte

It actually creates the posts in /r/ right up to where it crashes, and with a little trial and error work I was able to isolate the problem to a specific line in a specific .csv file, which was a comment that used a "U+2019 Right Single Quotation Mark" (UTF-8 Encoding: 0xE2 0x80 0x99) as an apostrophe. When I replaced that character with a normal straight single quotation mark in the .csv file, it parsed the file fine. (I don't quite get "position 62" though, the apostrophe was the 45th character on the line.) The really puzzling thing is other comments from the same user have the same character elsewhere in the same .csv file, but those don't cause a problem.

Well, it crashed out again after I fixed that, but in a different place from eight months later and "position 159". I guess I have another buggy character to hunt down. Don't have time right now. I will update later if this second one reveals any further clues.

libertysoft3 commented 3 years ago

did you try the windows thing here? https://github.com/libertysoft3/reddit-html-archiver#install

libertysoft3 commented 3 years ago

potential duplicate of https://github.com/libertysoft3/reddit-html-archiver/issues/23

MattPeterson0 commented 3 years ago

Oh, this bit?

chcp 65001 set PYTHONIOENCODING=utf-8

I should have said that. Yes, with or without doing that, same issue. I even re-fetch_links.py'd the entire thing because I hadn't done the 65001/utf-8 thing the first time. Didn't help.

I also came back here and grabbed the current copy of write_html.py in case some update since my original download changed things. Nope, same problem.

It's very mysterious! Haven't had time to poke at it more, maybe next week.