petewarden / dstk

A collection of the best open data sets and open-source tools for data science
http://www.datasciencetoolkit.org/
1.12k stars 186 forks source link

Internal Server Error on html2story, html2text #43

Closed eddof13 closed 2 years ago

eddof13 commented 10 years ago

Trying to pull in the HTML from a news article returns Internal Server Error on both datasciencetoolkit.org and my Amazon AMI. Seems to happen on any legitimate news article, CNN or Reuters I've tested it on, I have to cut down the request to approximately 3k in length for it to succeed.

Test article http://www.reuters.com/article/2014/05/19/us-usa-security-imam-idUSBREA4I0NL20140519?feedType=RSS&feedName=domesticNews

HTTP/1.1 500 Internal Server Error
Date: Mon, 19 May 2014 21:45:16 GMT
Server: Apache/2.2.22 (Ubuntu)
Status: 500 Internal Server Error
Vary: Accept-Encoding
Content-Length: 630
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>500 Internal Server Error</title>
</head><body>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error or
misconfiguration and was unable to complete
your request.</p>
<p>Please contact the server administrator,
 [no address given] and inform them of the time the error occurred,
and anything you might have done that may have
caused the error.</p>
<p>More information about this error may be available
in the server error log.</p>
<hr>
<address>Apache/2.2.22 (Ubuntu) Server at www.datasciencetoolkit.org Port 80</address>
</body></html>