kasramp / InternetWayBackMachine

Submit URLs to archive.org easily
https://madadipouya.com/portfolio/the-internet-wayback-machine/
MIT License
7 stars 0 forks source link

Errors out when URLs have percent encoding. #2

Closed GhbSmwc closed 4 years ago

GhbSmwc commented 5 years ago

Example:

java -jar InternetWaybackMachine.jar https://s3-ap-northeast-1.amazonaws.com/uchinoko/chara_images/pictures/000/021/892/original/%E5%AD%A6%E8%80%85.jpeg?1426006855

URL to save:

https://s3-ap-northeast-1.amazonaws.com/uchinoko/chara_images/pictures/000/021/892/original/学者.jpeg?1426006855

This results a fail to save unless you convert them to their original characters and use chcp 65001. Tested using windows 10.

GhbSmwc commented 5 years ago

Did some additional testing with only using the internet archive, and this is the result:

  1. Enter any of these addresses:

into the internet archive's home page. Regardless which you choose, you end up with this:

image

  1. Then go to the address bar on the page and try to enter the percent-encoded version: https://s3-ap-northeast-1.amazonaws.com/uchinoko/chara_images/pictures/000/021/892/original/%E5%AD%A6%E8%80%85.jpeg?1426006855 it then now works.

It turns out that the internet archive only encode the URLs to non UTF-8 characters if you enter the address from the home page. I believed the problem is happening at the IA and not this tool, as a while ago, it did work using the chcp 65001 and just use the japanese URLs and not the percent-encoded form.

Important to note: most browsers (chrome and firefox) automatically converts URLs to percent encoding when copying them from their address bar. It does not happen if you don't copy all the characters in it (leave out the “h” in “https” or “http”).

Side note: percent encoding documentation for non-ascii characters: https://www.w3.org/International/articles/idn-and-iri/

kasramp commented 4 years ago

See #3