mediacloud / news-search-api

Internal API server that offers search access to the Media Cloud Online News Archive (in Elasticsearch).
https://mediacloud.org
GNU Affero General Public License v3.0
1 stars 3 forks source link

story list query fails with "header too long error" #75

Closed rahulbot closed 1 month ago

rahulbot commented 2 months ago

If I run a query that includes lots of domains, I get an error:

requests.exceptions.ConnectionError: ('Connection aborted.', LineTooLong('got more than 65536 bytes when reading header line'))

This prevents me from downloading the story list CSV in the search tools, and even making API queries doesn't work. Looking at this code, I believe the critical line causing the error is this one: https://github.com/mediacloud/news-search-api/blob/0a25568dafafc5fbb53706e7d20d23ffad6350e1/api.py#L512

Note how it attempts to include the entire qurl in the header link. I believe that this value is way too long because of all the domains in my giant query, throwing an error within requests when it is trying to be parsed. We don't actually use that link header for paging; in all the code I've seen we've use the x-resume-token that is included as a separate header just above. Since this is an internal API we could just remove this header, or add code to ensure that it is not too long.

I suggest making a change either way and deploying it to the staging release. It can be tested by creating a tunnel to the staging deployment and using the following proposed test case for mediacloud-news-client locally on a dev machine (which is how I triggered the error):

    def test_header_too_long(self):
        self._api.TIMEOUT_SECS = 500
        with open(os.path.join(os.path.dirname(__file__), 'data', 'Collection-38379429-sources-20240611133536.csv')) as f:
            reader = csv.DictReader(f)
            all_domains = set([row['name'] for row in reader])
        query = '"new trial" AND canonical_domain:({})'.format(" OR ".join(all_domains))
        start_date = dt.datetime(2024, 1, 1)
        end_date = dt.datetime(2024, 6, 16)
        page1, next_token1 = self._api.paged_articles(query, start_date, end_date)
        assert len(page1) > 0

(Collection-38379429-sources-20240611133536.csv)

rahulbot commented 2 months ago

Works for me on staging 👍🏽 If all unit tests from mc-providers/api-client pass I'd say safe to release.

pgulley commented 1 month ago

Deployed on prod last week!