internetarchive / iari

Import workflows for the Wikipedia Citations Database
GNU General Public License v3.0
11 stars 9 forks source link

As a data consumer i want the /check-url endpoint to accurately cache results #875

Open mojomonger opened 1 year ago

mojomonger commented 1 year ago

when i run the check-url endpoint with the following endpoint:

https://archive.org/services/context/iari/v2/check-url?url=https://web.archive.org/web/20170726234423/https://minnesotastreetproject.com/exhibitions/1275-minnesota-st/internet-archive%E2%80%99s-2017-artist-residence-exhibition

it does not have the "teastdeadlink_status_code" property in the returned results. This indicates that something is wrong with the caching process, as a previous fetch with /check-url was done with the "refresh=true" flag set.

first_level_domain: "archive.org",
fld_is_ip: false,
url: "[https://web.archive.org/web/20170726234423/https://minnesotastreetproject.com/exhibitions/1275-minnesota-st/internet-archive’s-2017-artist-residence-exhibition](https://web.archive.org/web/20170726234423/https://minnesotastreetproject.com/exhibitions/1275-minnesota-st/internet-archive%E2%80%99s-2017-artist-residence-exhibition)",
fixed_url: "",
scheme: "https",
netloc: "web.archive.org",
tld: "org",
unrecognized_tld_length: false,
added_http_scheme_worked: false,
malformed_url: false,
malformed_url_details: null,
request_error: false,
request_error_details: "",
dns_record_found: true,
dns_no_answer: false,
dns_error: false,
status_code: 200,
timeout: 60,
dns_error_details: "",
response_headers: {},
timestamp: 1682018676,
isodate: "2023-04-20T19:24:36.221801",
id: "bbfeb6dd",
served_from_cache: true

when the check-url is run with refresh=true, a 500 error occurs:

https://archive.org/services/context/iari/v2/check-url?refresh=true&url=https://web.archive.org/web/20170726234423/https://minnesotastreetproject.com/exhibitions/1275-minnesota-st/internet-archive%E2%80%99s-2017-artist-residence-exhibition

returns:

Internal Server Error The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.

It appears something is going wrong with the processing of this url when refresh=true

dpriskorn commented 1 year ago

I suggest we dedicate time to check-urls instead and add testing of the cache to that new endpoint instead.

Then when merged we deprecate the check-url endpoint and close all stories related to it like this one.

mojomonger commented 1 year ago

i do not agree. i think we should keep the check-url (singular) endpoint. If, when you do implement the check-urls endpoint (plural), you can utilize the same code internally when only one url is checked.

This, i think, is a better API, as people would sometimes only want to check 1 url, and, by fixing this bug NOW, it makes the demo version of our software, IARE, look OK and reliable. The way it is now, we get a horrible error message because the data is corrupt: image

markjohngraham commented 1 year ago

Right!

Please keep it.

On Jun 18, 2023, at 3:23 PM, mojomonger @.***> wrote:

i do not agree. i think we should keep the check-url (singular) endpoint. If, when you do implement the check-urls endpoint (plural), you can utilize the same code internally when only one url is checked.

This, i think, is a better API, as people would sometimes only want to check 1 url, and, by fixing this bug NOW, it makes the demo version of our software, IARE, look OK and reliable. The way it is now, we get a horrible error message because the data is corrupt: https://user-images.githubusercontent.com/550079/246694452-479ecc5b-98ed-4e1b-aefb-8d6541965612.png — Reply to this email directly, view it on GitHub https://github.com/internetarchive/iari/issues/875#issuecomment-1596289900, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADYBLMCHSBKCUTYUZCLVMGTXL55UJANCNFSM6AAAAAAZJDWFVY. You are receiving this because you are subscribed to this thread.

mojomonger commented 1 year ago

@dpriskorn - is there a way to test this in your debug environment?

i would try:

If the new cache file does not replace the old,existing cache file, then the bug lies within this logic

Also, the following:

When you run this /check-url, the "old, cached" version is returned:

https://archive.org/services/context/iari/v2/check-url?url=https://minnesotastreetproject.com/exhibitions/1275-minnesota-st/internet-archive%E2%80%99s-2017-artist-residence-exhibition

Bit this gives a 500 error, when refresh=true is added:

https://archive.org/services/context/iari/v2/check-url?url=https://minnesotastreetproject.com/exhibitions/1275-minnesota-st/internet-archive%E2%80%99s-2017-artist-residence-exhibition&refresh=true

That should help you in that you should be able to see why it is breaking on that URL.

dpriskorn commented 1 year ago

Thanks for the examples. I'll look into it soon.

mojomonger commented 1 year ago

Please do. this is a very glaring example of IARE showing incorrect information based on IARI data.

dpriskorn commented 1 year ago

This bug is related to url encoding and weird characters image It cause a UnicodeEncodeError in the gunicorn worker so the content is never saved to disk.

dpriskorn commented 1 year ago

Fixed now image

mojomonger commented 1 year ago

Great! Glad it is fixed. Could you add a (short) description of what the fix entailed, and which modules were affected? Thanks.

dpriskorn commented 1 year ago

Please see the commits in the PR linked and ask questions there if anything is unclear.