internetarchive / wayback-machine-webextension

A web browser extension for Chrome, Firefox, Edge, and Safari 14.
GNU Affero General Public License v3.0
668 stars 207 forks source link

Failure to save a page when it's URL contains Unicode characters #350

Open twlz0ne opened 5 years ago

twlz0ne commented 5 years ago

Summary

Failure to save a page when it's URL contains Unicode characters.

wayback-1

Actual Behavior

Failure to save the page

Expected Behavior

Success to save the page

Steps to Reproduce:

  1. Visit https://github.com/lujun9972/emacs-document/blob/master/elisp-common/Emacs字节码内部说明.org
  2. Click Save Page Now
  3. A new tab was automatically opened and the address bar was filled in with url: https://web.archive.org/web/20190416080840/https://github.com/lujun9972/emacs-document/blob/master/elisp-common/Emacs%25E5%25AD%2597%25E8%258A%2582%25E7%25A0%2581%25E5%2586%2585%25E9%2583%25A8%25E8%25AF%25B4%25E6%2598%258E.org

Environment

sr6033 commented 5 years ago

It is working. The page is getting saved. My environment is same as yours. This issue has been resolved in the new release which is under development. It hasn't been reflected in the chrome webstore production yet. You can use the current version by cloning this repo.

image

Saved page: https://web.archive.org/web/20190416092945/https://github.com/lujun9972/emacs-document/blob/master/elisp-common/Emacs%E5%AD%97%E8%8A%82%E7%A0%81%E5%86%85%E9%83%A8%E8%AF%B4%E6%98%8E.org

twlz0ne commented 5 years ago

I also tried to submit the same URL via command savepagenow, and got the expected result:

⋊> savepagenow 'https://github.com/lujun9972/emacs-document/blob/master/elisp-common/Emacs字节码内部说明.org'

savepagenow.api.CachedPage: archive.org returned a cached version of this page: https://web.archive.org/web/20190416081703/https://github.com/lujun9972/emacs-document/blob/master/elisp-common/Emacs%E5%AD%97%E8%8A%82%E7%A0%81%E5%86%85%E9%83%A8%E8%AF%B4%E6%98%8E.org

wayback-2

twlz0ne commented 5 years ago

@sr6033 Thank you for noticing that

I checked the development repo, indeed it's working. But it returned me a beta test URL:

wayback-machine-chrome-beta

If I click the URL directly, then it tells me:

You reached Wayback Machine Closed Beta Test Site

To use the public Wayback Machine » click here (http://web.archive.org/) «

So I need to manually edit to get the correct URL.

sr6033 commented 5 years ago

Yes. The beta site is still under development I suppose.

cgorringe commented 4 years ago

I noticed that these "404" pages happen with Github pages in particular, and not necessarily due to Unicode characters. I just tested with latest master branch and got the same 404 page posted above. The save API call appeared to work fine. I wonder if Github is intentionally feeding these pages to the Archive's bots?