internetarchive / iari

Import workflows for the Wikipedia Citations Database
GNU General Public License v3.0
12 stars 9 forks source link

Bug: check-url returns status_code of 0 when it is a 404 #877

Closed mojomonger closed 1 year ago

mojomonger commented 1 year ago

using the check-url endpoint for the url "http://www.uri.edu/artsci/ecn/starkey/ECN398 -Ecology, Economy, Society/RAPANUI.pdf", the status_code and the testdeadlink_status_code fields are set to 0.

Could this be because of the spaces in the URL?

The IABOT and CORENTIN methods both correctly return 404

IARI https://archive.org/services/context/iari/v2/check-url?url=http://www.uri.edu/artsci/ecn/starkey/ECN398%20-Ecology,%20Economy,%20Society/RAPANUI.pdf&refresh=true

{
first_level_domain: "uri.edu",
fld_is_ip: false,
url: "http://www.uri.edu/artsci/ecn/starkey/ECN398 -Ecology, Economy, Society/RAPANUI.pdf",
scheme: "",
netloc: "",
tld: "",
malformed_url: false,
malformed_url_details: null,
archived_url: "",
wayback_machine_timestamp: "",
is_valid: false,
request_error: false,
request_error_details: "",
dns_record_found: false,
dns_no_answer: false,
dns_error: false,
status_code: 0,
testdeadlink_status_code: 0,
timeout: 2,
dns_error_details: "",
response_headers: { },
detected_language: "",
detected_language_error: false,
detected_language_error_details: "",
timestamp: 1687232230,
isodate: "2023-06-20T03:37:10.651098",
id: "1b6413b4"
}

IABOT:

% curl -XPOST https://iabot-api.archive.org/testdeadlink.php \
-d $'urls=http://www.uri.edu/artsci/ecn/starkey/ECN398%20-Ecology,%20Economy,%20Society/RAPANUI.pdf' \
-d "authcode=579331d2dc3f96739b7c622ed248a7d3" \
-d "returncodes=1"

{
    "results": {
        "http:\/\/www.uri.edu\/artsci\/ecn\/starkey\/ECN398 -Ecology, Economy, Society\/RAPANUI.pdf": 404,
        "errors": {
            "http:\/\/www.uri.edu\/artsci\/ecn\/starkey\/ECN398 -Ecology, Economy, Society\/RAPANUI.pdf": "RESPONSE CODE: 404"
        }
    },
    "servetime": 1.0557
}

CORENTIN:

% curl -XPOST https://iabot-api.archive.org/undertaker/check \
-d '{ "urls": [ "http://www.uri.edu/artsci/ecn/starkey/ECN398%20-Ecology,%20Economy,%20Society/RAPANUI.pdf" ] }'

[{"url":"http://www.uri.edu/artsci/ecn/starkey/ECN398%20-Ecology,%20Economy,%20Society/RAPANUI.pdf","http_status_code":404,"http_status_message":"404 Not Found"}]
dpriskorn commented 1 year ago

Yet another reason to just scrap the whole thing and make a wrapper for the two others 🤷‍♂️

mojomonger commented 1 year ago
  1. what do you mean by "the two others" ? Please be specific. if you mean Corentin vs IABot, then yes, we can make this a wrapper for those
  2. HOWEVER, this IS breaking for IABot, in that a testdeadlink_status_code of 0 is being returned when it should be 404.
  3. We need this report to be ACCURATE
  4. For some reason, the check-url logic is dismissing this as an error of some sort and returning a status_code of 0. If it is a parsing erring error, then malformed_url should be equal to true, which it is not.
  5. I suspect it is returning before even trying to access the status via IABOT, as you can see from the IABOT curl code. This leads me to believe it is returning because of a URL parsing error
mojomonger commented 1 year ago

This url is also a real 200, but IARI is returning 0:

https://archive.org/search.php?query=%22easter%20island%22&and%5b%5d=mediatype%3A%22texts%22

dpriskorn commented 1 year ago

Could it be timeout related? Did you try to increase the timeout? The default is very short, 2 sek if I remember correctly

dpriskorn commented 1 year ago

Could it be timeout related? Did you try increase the timeout? Tried https://archive.org/services/context/iari/v2/check-url?url=http://www.uri.edu/artsci/ecn/starkey/ECN398%20-Ecology,%20Economy,%20Society/RAPANUI.pdf&refresh=true&timeout=60

And Got a response immediately so this really is s bug.

mojomonger commented 1 year ago

Yes, it is :) @dpriskorn What module/source file do you think this is in, where it is sending back a status code of 0?

dpriskorn commented 1 year ago

This class https://github.com/internetarchive/iari/blob/main/src/models/identifiers_checking/url.py

dpriskorn commented 1 year ago

I investigated. This is caused by the validation returning false. image -> is_valid: False

We only check status codes on valid urls according to our URL checker (which seem buggy and no longer needed IMO). See https://github.com/internetarchive/iari/blob/main/src/models/wikimedia/wikipedia/url.py#L160 This is thus expected behavior following the current design, so I'm closing this as it is not a bug.

If you want me to remove the url validation code and just relay whatever the user is sending to the endpoint to testdeadlink, please open a new issue.