[Perhaps enhance] Would 'curl-impersonate' help better avoid or reduce the 429 issue?

simonsww commented 6 days ago

Currently, the network is accessed via HTTP. Can it be changed to use curl-impersonate, which supports fingerprint modification? Would this help better avoid or reduce the 429 issue?

https://github.com/sdil87/trendspy/blob/9def6904dd2dd93364a1720e686b515df764ca6d/src/trendspy/client.py#L206

Looking forward to your reply, thank you!

sdil87 commented 5 days ago

I attempted to work around the 429 error by using alternatives like undetected_chromedriver and selenium_stealth. However, the issue persists during aggressive scraping. Based on my observations, the error is primarily dependent on the IP address and the number of requests, regardless of the scraping method used.

To clarify, the 429 error is what’s shown in the image—it occasionally even triggers a CAPTCHA challenge.

In my experience, the most reliable way to avoid 429 errors is to implement pauses between requests. For instance:

I successfully made 500 interest_over_time requests (with 5 keywords each) by including a 5-second pause between them.
Similarly, I tested 600 related_topics requests with a 1-second pause and did not encounter any 429 errors in either case.

Additionally, if the list of keywords remains unchanged and we only need to "refresh" the data, it is possible to cache the token for future requests for the same query. Here's an example:

from trendspy import Trends
from trendspy.converter import TrendsDataConverter
from datetime import datetime, timedelta, timezone
from time import sleep

now = datetime.now(timezone.utc)
before = now - timedelta(hours=2)
after = now + timedelta(minutes=115)

# Form a timeframe of less than 4 hours to retrieve minute-level data
# This includes a future timeframe (115 minutes ahead)
timeframe = f"{before.strftime('%Y-%m-%dT%H')} {after.strftime('%Y-%m-%dT%H')}"

token, data = tr.interest_over_time('python', timeframe=timeframe, return_raw=True)
df = TrendsDataConverter.interest_over_time(data, ['python'])
print(df.tail())

sleep(120)
new_data = tr._token_to_data(token)
df = TrendsDataConverter.interest_over_time(new_data, ['python'])
print(df.tail())

Alternatively, we could use trending_now_showcase_timeline to fetch additional data as needed (tested with 500+ keywords per request).

Hope this helps!

simonsww commented 5 days ago

This is very helpful, thank you for your efforts.

sdil87 commented 4 days ago

In version v0.1.5, I optimized the request handling logic. For each interest_over_time call, two requests are required (one to fetch the token and another to fetch the data using the token). In earlier versions, if the second request (data fetching) encountered a 429 error, it would fail immediately without retrying. This has been fixed in v0.1.5, where retries are now implemented for such cases.

Additionally, I introduced a native delay feature in the library (Trends(request_delay=1)), which can help mitigate 429 errors. This delay is primarily designed for users who are new to the library to prevent them from unintentionally blocking themselves (which is very easy to do). It is worth noting that request_delay can be disabled by setting request_delay=0 for advanced users who prefer more control over their request rate. Without a delay, at my rate, the system quickly hit permanent 429 errors (long-term rate limiting, unclear when it resets).

I also tested bypassing the 429 error using curl-impersonate (via curl_cffi), but unfortunately, it didn’t help. However, I had some moderate success with httpx and tls_client, though these approaches often escalated to 302 errors (Captcha). If you’d like to test this yourself, you can use the following snippet:

import httpx

httpx_proxies = {"http://": proxy, "https://": proxy}
tr = Trends()
tr.session = httpx.Client(proxies=httpx_proxies)

While this method might allow slightly more data retrieval, the restrictions become more severe (Captcha) as a result.

Overall, with the default settings in v0.1.5, I’m now able to consistently make 100+ requests without any errors.

sdil87 / trendspy

[Perhaps enhance] Would 'curl-impersonate' help better avoid or reduce the 429 issue? #4