internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.1k stars 1.33k forks source link

Some URLs containing a readable slug and a trailing `/` character cause unhandled internal errors when visited #8990

Open jimchamp opened 5 months ago

jimchamp commented 5 months ago

Problem

Some URLs, when visited, cause the site to serve an unstyled "Internal Server Error" page. This is happening for author, edition, and work URLs that have a trailing / character following a readable slug that contains non-latin characters.

Evidence / Screenshot

Relevant URL(s)

Working URL: https://openlibrary.org/books/OL4210801M/Das_schwerho%CC%88rige_Kind

Broken URL: https://openlibrary.org/books/OL4210801M/Das_schwerho%CC%88rige_Kind/

Reproducing the bug

  1. Go to https://openlibrary.org/books/OL4210801M/Das_schwerho%CC%88rige_Kind/

Context

Related to this Sentry issue

Notes from this Issue's Lead

Proposal & constraints

We suspect that this error may have something to do with how our ReadableUrlProcessor is encoding the readable slug and redirecting.

A similar issue was solved by creating a safe_seeother function, which encodes a URL using encode_url_path before passing it to web.seeother.

Maybe encode_url_path can be used by the ReadableUrlProcessor before any redirects?

Related files

Code changes likely only needed in this method: https://github.com/internetarchive/openlibrary/blob/bee8ab4e369a500e1e31f73d3f06fb88e7c5896f/openlibrary/core/processors/readableurls.py#L40

Stakeholders

@cdrini

mekarpeles commented 5 months ago

@jimchamp recommends P2 since we're hitting sentry errors of a similar form and getting this class of errors removed will give us more clarity. Ty!

KareemDream commented 4 months ago

Hi, I'm a college student interested in working on open-source projects. Could I be assigned this issue? Thanks.

jimchamp commented 4 months ago

@KareemDream, you've been assigned, thanks! I've updated the original message in this thread to include the function that needs to be updated. The solution may be as simple as removing the trailing / character, if one exists, from the path.

kren333 commented 4 months ago

Hi there, I am a classmate of @KareemDream who has been working with him to solve this issue!

In short, we have recently found that the underlying bug behind this issue seems to be larger-scoped than initially thought. We say this after some local testing and logging, and finding that the error message shows up before the call function has even been called.

To begin, we reproduce below a screenshot of the bug, from the logger's perspective:

image

As you can see no print statements are called before the bug occurs. However, we have added a print statement as soon as the call function is entered:

image

Due to this, we believe that the UnicodeEncodeError that we are getting stems from some upstream issue; however, the stack trace was not very informative and we are currently unable to determine where it comes from.

Because of the end of the spring semester at our university we unfortunately do not anticipate having time to finish out this issue. As a result we have 2 requests: 1) do our concerns seem valid/would it be reasonable to rescope this issue, and 2) would it be possible to reassign this issue altogether?

Thank you so much!

jimchamp commented 4 months ago

Thanks for looking into this! I'll have to do some more investigation into where this error is occurring.