ranaroussi / yfinance

Download market data from Yahoo! Finance's API
https://aroussi.com/post/python-yahoo-finance
Apache License 2.0
14.62k stars 2.43k forks source link

efficiently download info for multiple symbols #1647

Open Courvoisier13 opened 1 year ago

Courvoisier13 commented 1 year ago

yf.download downloads only prices and it has a threads option making the download very fast.

However, if I want to download info for multiple symbols, I have to use Tickers and when I loop through the results it seems to be downloading the entire API results (inc prices), which is very slow and there is no parallelization option.

minimal example:

import yfinance as yf
def __filter_dict(bigdict, sub_key):
  return {k: bigdict.get(k, None) for k in sub_key}

fields = ['symbol', 'quoteType', 'shortName', 'longName', 'annualReportExpenseRatio']
ticker_list = ['QQQ', 'SPY']
tickers = yf.Tickers(ticker_list)

tickers_items = tickers.tickers.items()
list_dict = [__filter_dict(value.info, fields)  for key, value in tickers_items]

The loop for key, value in tickers_items is triggering a full download of all the assets in tickers_items, inc prices, which I am downloading using yf.download in a different place.

Any ideas on how to improve the efficiency and speed of this?

thanks

ValueRaider commented 1 year ago

Why you didn't use the "Bug report" issue template?

Courvoisier13 commented 1 year ago

it's not really a bug,. I can repost as a bug report.?

ValueRaider commented 1 year ago

Fair point, leave this issue up. But your code does not work.

Courvoisier13 commented 1 year ago

I forgot tickers_items = tickers.tickers.items() correcting, sorry!

ValueRaider commented 1 year ago

Probably this can be multithreaded, but sounds like you need to scrape smarter with caching

Courvoisier13 commented 1 year ago

Wouldn't that make the call slower, by introducing limits?

btw, I parallelized with:

import yfinance as yf
from concurrent.futures import ThreadPoolExecutor

def __filter_dict(bigdict, sub_key):
  return {k: bigdict.get(k, None) for k in sub_key}

def process_ticker(ticker_item, fields):
    key, value = ticker_item
    return __filter_dict(value.info, fields)

def retry_process_ticker(ticker, fields, max_retries=3, retry_interval=5):
    retries = 0
    while retries < max_retries:
        try:
            return process_ticker(ticker, fields)
        except Exception as e:
            print(f"Failed processing {ticker}: {e}")
            retries += 1
            if retries < max_retries:
                print(f"Retrying in {retry_interval} seconds...")
                time.sleep(retry_interval)
            else:
                error_dict = {f:'' for f in fields}
                error_dict['symbol'] = ticker[0]
                return error_dict

fields = ['symbol', 'quoteType', 'shortName', 'longName', 'annualReportExpenseRatio']
ticker_list = ['QQQ', 'SPY']
tickers = yf.Tickers(ticker_list)
tickers_items = tickers.tickers.items()
num_threads = 6
with ThreadPoolExecutor(max_workers=num_threads) as executor:
    fund_info_list_dict = list(executor.map(retry_process_ticker, tickers_items, [fields]*len(tickers_items)))
ValueRaider commented 1 year ago

Then don't rate-limit just cache 🤷

kschmid commented 1 year ago

I have a related issue and I am wondering, why download it is not rate-limiting for me (and I am worried that it may end up in a block): (sorry, I am not that experienced in python and pretty new to this library.) I did create a cache according to the provided guidelines at https://github.com/ranaroussi/yfinance#smarter-scraping

class CachedLimiterSession(CacheMixin, LimiterMixin, Session): pass
session = CachedLimiterSession(
    limiter=Limiter(RequestRate(2, Duration.SECOND*5)),  # max 2 requests per 5 seconds
    bucket_class=MemoryQueueBucket,
    backend=SQLiteCache(full_cache_path), # alternative SQLiteCache(use_memory=True)
)

When I do the initial download, everything is fine, it is obviously slowed down in downloading. (getting initial ticker + full history + 2m history + 1m history gets to over four minutes for ~30 tickers

However, upon redownload, when I only download the 1m data (for updates), all of this goes through in about 8 secs, i.e., ca. 3,75 / sec. (everything without threading) As these are true downloads (I checked the code and it indeed downloads 2720 lines of data), this would be well above the rate limiting of 2 requests per 5 seconds, even though I am using the ticker created with this session. Am I doing anything wrong or does rate limiting only work while initially ticker creation, and not while data downloading. Or does this somehow get lost, when I reuse cached information from a previous run (the cache is on disk)?

(If everything is ok and this is just how it supposed to work, it would also show how to solve the problem of the initial thread creator, but then I would be even more worried regarding the danger of getting blocked..)

ValueRaider commented 1 year ago

the second 1m download is much faster than first

Obviously it is using the cached data.

kschmid commented 1 year ago

my bad.. - ok, I should check again after the weekend, when there have been new trades in between (and it should then redownload to be complete). Thanks!

ValueRaider commented 1 year ago

when there have been new trades in between

requests_cache is a dumb cache, it only caches get requests.

kschmid commented 1 year ago

Thanks, for pointing this out. I didn't have to deal with caching before. I am a bit concerned that the info on this github does not include this disclaimer.

I checked out yfinance-cache, but it was significantly slower initially and then crashed with an internal error where the existing code works. (I will post more on their github) I resorted to trying to make two sessions and give yfinance the corresponding session (altering:

ticker.session = non_cached_session
ticker._data._session = non_cached_session

but couldn't get this to work. It would be really cool if yfinance would give more control over its caching. (also for identifying whether a request was served from cache or not.) I finally resorted to the simplest approach: setting more of the cache control flags and this is providing the behavior apparently that was expected = some of the requests are from the cache and some are from finance. Not sure this is exactly as hoped, but will have to check. (But it is now redoing 1/4 of requests, which was exactly what I expected for my code.)

I am now using:

class CachedLimiterSession(CacheMixin, LimiterMixin, Session): pass
session = CachedLimiterSession(
    limiter=Limiter(RequestRate(2, Duration.SECOND*5)),  # max 2 requests per 5 seconds
    bucket_class=MemoryQueueBucket,
    backend=SQLiteCache(full_cache_path), # alternative SQLiteCache(use_memory=True)
    cache_control=True, # Use Cache-Control response headers for expiration, if available
    expired_after=Duration.SECOND*60*60*24*7, # cache expires after 7 days
)

i.e., adding the last two flags at the end. I hope this works as expected, but of course, it is unclear whether the response headers have correct expiration. - I did also set expired_after to ensure that after some time the base data is redownloaded.

ValueRaider commented 1 year ago

It would be really cool if yfinance would give more control over its caching.

I don't like this, given this can be solved easier via direct interaction with requests_cache

kschmid commented 1 year ago

Maybe I am not deeply enough into this stuff. I tried as much as I could get from the request cache docu. But this is not easy, because without deep analysis of yfinance code and a significant understanding of request_cache, which also is not easy, it seems pretty unclear how to achieve any of the things above, e.g., making individual calls bypass the cache; have specific cache-expires for specific kinds of calls. (Let alone identify the result of a request - except for its speed) If there is an easy way to control this, pointers are welcome.

ValueRaider commented 1 year ago

making individual calls bypass the cache; have specific cache-expires for specific kinds of calls.

You did not specify you wanted this. I had a quick chat with ChatGPT - you can achieve url-specific bypass, but not url-specific expiry without separate cache sessions. But I sense this discussion is beyond the scope of the original issue, and maybe should continue in #1662.

kschmid commented 1 year ago

We use the same tools ;-) - You can do URL-specific bypasses with caching, but this does not help. As a user of yfinance I don't know (and should not care) what URLs are used. This observation was what switched me to session mangling. But with code-reading (and ChatGPT ;-) ) I could not come up with a solution that works reasonably. (Hence the other issue.) May I ask what you meant with "direct interaction with requests_cache", because apparently I can no longer directly interact with it once yfinance takes and copies it around. (And I wouldn't know what kind of interactions would help)

ValueRaider commented 1 year ago

May I ask what you meant with "direct interaction with requests_cache"

I mean you configure the requests_cache session object before passing it to yfinance.

kschmid commented 1 year ago

ah ok, thanks. Yeah this is what I am doing (and it serves some of my needs; perhaps my needs are somewhat specialised)