HTTP 429 Rate Limit error on reading observations

nigelcharman commented 6 months ago

The problem

We have a one off process pulling data from iNaturalist. After running for a minute or two, we are seeing the following HTTP 429 error.

2024-03-26 18:22:47 INFO ----------------------
2024-03-26 18:22:47 INFO Request:
GET https://api.inaturalist.org/v1/observations?id=2869994&only_id=false
User-Agent: python-requests/2.31.0 pyinaturalist/0.18.0
Accept-Encoding: gzip, deflate
Accept: application/json
Connection: keep-alive

2024-03-26 18:22:48 INFO Rate limit exceeded for https://api.inaturalist.org/v1/observations?id=2869994&only_id=false; filling limiter bucket
Traceback (most recent call last):
  File "/home/runner/work/inaturalist-to-cams/inaturalist-to-cams/mainMigrate.py", line 37, in <module>
    main(param1)
  File "/home/runner/work/inaturalist-to-cams/inaturalist-to-cams/mainMigrate.py", line 28, in main
    copy_count = copier.copyiNatLocations_to_existing_CAMS_features(how_many_records_to_migrate)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/inaturalist-to-cams/inaturalist-to-cams/migration/migrate.py", line 58, in copyiNatLocations_to_existing_CAMS_features
    observation = self.get_observation_from_id(observationID)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/inaturalist-to-cams/inaturalist-to-cams/migration/migrate.py", line 36, in get_observation_from_id
    observation = pyinaturalist.get_observation(observation_id)       
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/pyinaturalist/v1/observations.py", line 583, in get_observation
    response = get_observations(id=observation_id, access_token=access_token, **params)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/forge/_revision.py", line 328, in inner
    return callable(*mapped.args, **mapped.kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/pyinaturalist/v1/observations.py", line 81, in get_observations
    observations = get(f'{API_V1}/observations', **params).json()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/pyinaturalist/session.py", line 358, in get
    return session.request('GET', url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/pyinaturalist/session.py", line 271, in request
    response.raise_for_status()
  File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.inaturalist.org/v1/observations?id=2869994&only_id=false
Error: Process completed with exit code 1.

Expected behavior

Our understanding was that pyinaturalist would apply the rate limiting to satisfy iNaturalist, and we are only seeing about one request per second.

Steps to reproduce the behavior

Create a script to repeatedly call pyinaturalist.get_observation(observation_id)

Workarounds

We could add a wait to our code

Environment

OS & version: Ubuntu 22.04.4
Python version: CPython 3.11.8
Pyinaturalist version or branch: 0.18.0

JWCook commented 6 months ago

I will probably need some more info about how this process is running. Is it running from a CI system or cloud storage provider with ephemeral storage? Does it use multiprocessing? Is it connecting to iNat from an IP shared with other services that also connect to iNat?

The API has per second, per minute, and per day rate limits, tracked per IP address (some more details here). To track these limits on the client side, a small persistent SQLite table records when recent requests have been made (via requests-ratelimiter + pyrate-limiter). That's sufficient for a single process and multithreading with persistent storage, but some extra work is needed to handle some other scenarios like the ones mentioned above.

nigelcharman commented 6 months ago

The process is running as a GitHub Action. It is single threaded. It's possible that GitHub shares the IP with other processes, however, we have another job running hourly on the same infrastructure that has never seen these issues.

@amazing-will - We could also test it from a local machine with the same parameters to see if we get the same results?

The thing that is different about this process is that it is retrieving 1 iNaturalist record at a time, whereas our other process pages 200 at a time. This is mostly since we are working from a list of observations and processing one at a time. Since it is a one-off process that only needs to process about 4,000 observations, this was a shortcut to get it working quickly. We could modify the code to read more at a time, but it would probably be easier to just add a delay, since the speed of execution isn't an issue.

nigelcharman commented 5 months ago

We implemented a workaround of adding a 1 second delay to the processing of each record. It's not ideal but is OK for a one-off job. We might look to rework our code if we need to run something similar on an ongoing basis.

Looking through the logs it appears to have been right on the cusp of what is allowed by iNaturalist, processing 60 records in 60 seconds. I wonder if it could be explained by network latency of a few ms causing iNaturalist to receive 60 requests in a few ms less than 60 seconds?

Anyway, I'll close for now since we no longer have this issue with our workaround

pyinat / pyinaturalist