internetarchive / iari

Import workflows for the Wikipedia Citations Database
GNU General Public License v3.0
12 stars 9 forks source link

As a data consumer i want check-url to return a status code of 200 for a URL that resolves to an .xlsx file #879

Open mojomonger opened 1 year ago

mojomonger commented 1 year ago

For the url:

https://www.ncei.noaa.gov/pub/data/normals/WMO/1981-2010/RA-III/Chile/WMO_Normals_Chile%20(5).xlsx

Both IABOT and CORENTIN return a 200, but IARI returns a status code of 0

IABOT:

curl -XPOST https://iabot-api.archive.org/testdeadlink.php \
-d $'urls=https://www.ncei.noaa.gov/pub/data/normals/WMO/1981-2010/RA-III/Chile/WMO_Normals_Chile%20(5).xlsx' \
-d "authcode=579331d2dc3f96739b7c622ed248a7d3" \
-d "returncodes=1"

{
    "results": {
        "https:\/\/www.ncei.noaa.gov\/pub\/data\/normals\/WMO\/1981-2010\/RA-III\/Chile\/WMO_Normals_Chile (5).xlsx": 200
    },
    "servetime": 0.5358
}

CORENTIN:

curl -XPOST https://iabot-api.archive.org/undertaker/check \
-d '{ "urls": [ "https://www.ncei.noaa.gov/pub/data/normals/WMO/1981-2010/RA-III/Chile/WMO_Normals_Chile%20(5).xlsx" ] }'

[{"url":"https://www.ncei.noaa.gov/pub/data/normals/WMO/1981-2010/RA-III/Chile/WMO_Normals_Chile%20(5).xlsx","http_status_code":200,"http_status_message":"200 OK"}]

IARI: https://archive.org/services/context/iari/v2/check-url?refresh=true&url=https://www.ncei.noaa.gov/pub/data/normals/WMO/1981-2010/RA-III/Chile/WMO_Normals_Chile%20(5).xlsx

{
first_level_domain: "noaa.gov",
fld_is_ip: false,
url: "https://www.ncei.noaa.gov/pub/data/normals/WMO/1981-2010/RA-III/Chile/WMO_Normals_Chile (5).xlsx",
scheme: "",
netloc: "",
tld: "",
malformed_url: false,
malformed_url_details: null,
archived_url: "",
wayback_machine_timestamp: "",
is_valid: false,
request_error: false,
request_error_details: "",
dns_record_found: false,
dns_no_answer: false,
dns_error: false,
status_code: 0,
testdeadlink_status_code: 0,
timeout: 2,
dns_error_details: "",
response_headers: { },
detected_language: "",
detected_language_error: false,
detected_language_error_details: "",
timestamp: 1687299624,
isodate: "2023-06-20T22:20:24.094718",
id: "077e9685"
}
mojomonger commented 1 year ago

If the solution to this is using IARI to wrap the IABOT testdeadlink code, then please do so. :) Having a working /check-url endpoint at this point is crucial