Open Courvoisier13 opened 1 year ago
Why you didn't use the "Bug report" issue template?
it's not really a bug,. I can repost as a bug report.?
Fair point, leave this issue up. But your code does not work.
I forgot tickers_items = tickers.tickers.items()
correcting, sorry!
Probably this can be multithreaded, but sounds like you need to scrape smarter with caching
Wouldn't that make the call slower, by introducing limits?
btw, I parallelized with:
import yfinance as yf
from concurrent.futures import ThreadPoolExecutor
def __filter_dict(bigdict, sub_key):
return {k: bigdict.get(k, None) for k in sub_key}
def process_ticker(ticker_item, fields):
key, value = ticker_item
return __filter_dict(value.info, fields)
def retry_process_ticker(ticker, fields, max_retries=3, retry_interval=5):
retries = 0
while retries < max_retries:
try:
return process_ticker(ticker, fields)
except Exception as e:
print(f"Failed processing {ticker}: {e}")
retries += 1
if retries < max_retries:
print(f"Retrying in {retry_interval} seconds...")
time.sleep(retry_interval)
else:
error_dict = {f:'' for f in fields}
error_dict['symbol'] = ticker[0]
return error_dict
fields = ['symbol', 'quoteType', 'shortName', 'longName', 'annualReportExpenseRatio']
ticker_list = ['QQQ', 'SPY']
tickers = yf.Tickers(ticker_list)
tickers_items = tickers.tickers.items()
num_threads = 6
with ThreadPoolExecutor(max_workers=num_threads) as executor:
fund_info_list_dict = list(executor.map(retry_process_ticker, tickers_items, [fields]*len(tickers_items)))
Then don't rate-limit just cache 🤷
I have a related issue and I am wondering, why download it is not rate-limiting for me (and I am worried that it may end up in a block): (sorry, I am not that experienced in python and pretty new to this library.) I did create a cache according to the provided guidelines at https://github.com/ranaroussi/yfinance#smarter-scraping
class CachedLimiterSession(CacheMixin, LimiterMixin, Session): pass
session = CachedLimiterSession(
limiter=Limiter(RequestRate(2, Duration.SECOND*5)), # max 2 requests per 5 seconds
bucket_class=MemoryQueueBucket,
backend=SQLiteCache(full_cache_path), # alternative SQLiteCache(use_memory=True)
)
When I do the initial download, everything is fine, it is obviously slowed down in downloading. (getting initial ticker + full history + 2m history + 1m history gets to over four minutes for ~30 tickers
However, upon redownload, when I only download the 1m data (for updates), all of this goes through in about 8 secs, i.e., ca. 3,75 / sec. (everything without threading) As these are true downloads (I checked the code and it indeed downloads 2720 lines of data), this would be well above the rate limiting of 2 requests per 5 seconds, even though I am using the ticker created with this session. Am I doing anything wrong or does rate limiting only work while initially ticker creation, and not while data downloading. Or does this somehow get lost, when I reuse cached information from a previous run (the cache is on disk)?
(If everything is ok and this is just how it supposed to work, it would also show how to solve the problem of the initial thread creator, but then I would be even more worried regarding the danger of getting blocked..)
the second 1m download is much faster than first
Obviously it is using the cached data.
my bad.. - ok, I should check again after the weekend, when there have been new trades in between (and it should then redownload to be complete). Thanks!
when there have been new trades in between
requests_cache
is a dumb cache, it only caches get
requests.
Thanks, for pointing this out. I didn't have to deal with caching before. I am a bit concerned that the info on this github does not include this disclaimer.
I checked out yfinance-cache, but it was significantly slower initially and then crashed with an internal error where the existing code works. (I will post more on their github) I resorted to trying to make two sessions and give yfinance the corresponding session (altering:
ticker.session = non_cached_session
ticker._data._session = non_cached_session
but couldn't get this to work. It would be really cool if yfinance would give more control over its caching. (also for identifying whether a request was served from cache or not.) I finally resorted to the simplest approach: setting more of the cache control flags and this is providing the behavior apparently that was expected = some of the requests are from the cache and some are from finance. Not sure this is exactly as hoped, but will have to check. (But it is now redoing 1/4 of requests, which was exactly what I expected for my code.)
I am now using:
class CachedLimiterSession(CacheMixin, LimiterMixin, Session): pass
session = CachedLimiterSession(
limiter=Limiter(RequestRate(2, Duration.SECOND*5)), # max 2 requests per 5 seconds
bucket_class=MemoryQueueBucket,
backend=SQLiteCache(full_cache_path), # alternative SQLiteCache(use_memory=True)
cache_control=True, # Use Cache-Control response headers for expiration, if available
expired_after=Duration.SECOND*60*60*24*7, # cache expires after 7 days
)
i.e., adding the last two flags at the end. I hope this works as expected, but of course, it is unclear whether the response headers have correct expiration. - I did also set expired_after to ensure that after some time the base data is redownloaded.
It would be really cool if yfinance would give more control over its caching.
I don't like this, given this can be solved easier via direct interaction with requests_cache
Maybe I am not deeply enough into this stuff. I tried as much as I could get from the request cache docu. But this is not easy, because without deep analysis of yfinance code and a significant understanding of request_cache, which also is not easy, it seems pretty unclear how to achieve any of the things above, e.g., making individual calls bypass the cache; have specific cache-expires for specific kinds of calls. (Let alone identify the result of a request - except for its speed) If there is an easy way to control this, pointers are welcome.
making individual calls bypass the cache; have specific cache-expires for specific kinds of calls.
You did not specify you wanted this. I had a quick chat with ChatGPT - you can achieve url-specific bypass, but not url-specific expiry without separate cache sessions. But I sense this discussion is beyond the scope of the original issue, and maybe should continue in #1662.
We use the same tools ;-) - You can do URL-specific bypasses with caching, but this does not help. As a user of yfinance I don't know (and should not care) what URLs are used. This observation was what switched me to session mangling. But with code-reading (and ChatGPT ;-) ) I could not come up with a solution that works reasonably. (Hence the other issue.) May I ask what you meant with "direct interaction with requests_cache", because apparently I can no longer directly interact with it once yfinance takes and copies it around. (And I wouldn't know what kind of interactions would help)
May I ask what you meant with "direct interaction with requests_cache"
I mean you configure the requests_cache
session object before passing it to yfinance.
ah ok, thanks. Yeah this is what I am doing (and it serves some of my needs; perhaps my needs are somewhat specialised)
yf.download
downloads only prices and it has a threads option making the download very fast.However, if I want to download info for multiple symbols, I have to use
Tickers
and when I loop through the results it seems to be downloading the entire API results (inc prices), which is very slow and there is no parallelization option.minimal example:
The loop
for key, value in tickers_items
is triggering a full download of all the assets in tickers_items, inc prices, which I am downloading usingyf.download
in a different place.Any ideas on how to improve the efficiency and speed of this?
thanks