internetarchive / iari

Import workflows for the Wikipedia Citations Database
GNU General Public License v3.0
12 stars 9 forks source link

As a data consumer i want /check-url endpoint to not produce 'text' property in output #867

Closed mojomonger closed 1 year ago

mojomonger commented 1 year ago

running check-url for :

https://archive.org/services/context/iari/v2/check-url?refresh=true&url=https://web.archive.org/web/20140403193826/http://travel.nationalgeographic.com/travel/world-heritage/easter-island/

produces json like this:

{
first_level_domain: "nationalgeographic.com",
. . .
status_code: 200,
testdeadlink_status_code: 200,
. . .
text: "

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" xmlns:fb="http://www.facebook.com/2008/fbml"> ..."

    <head><script src="//archive.org/includes/analytics.js?v=cf34f82" type="text/javascript"></script>
<script type=

( a whole lot of text describing the page> )

I do not think this field is needed, especially as it enlarges the json output tremendously

dpriskorn commented 1 year ago

It's a debug output. Since we are about to deprecate this endpoint I suggest we close this story as abandoned.