ranaroussi / yfinance

Download market data from Yahoo! Finance's API
https://aroussi.com/post/python-yahoo-finance
Apache License 2.0
13.25k stars 2.34k forks source link

Excessive HTML requests used, additional easily retrievable data not gotten #102

Closed GregoryMorse closed 4 years ago

GregoryMorse commented 5 years ago

The app may want to stop using and relying on Pandas tables for the financials and info as all of these can be fetched with a single call with a serious amount of additional information.

I provide the very straight-forward and simple parsing code to achieve this effect:

scrape_url = 'https://finance.yahoo.com/quote'
url = '%s/%s/%s' % (scrape_url, StockName, 'financials')
req = requests.get(url=url)
idx = req.text.find('root.App.main = ')
j = json.loads(re.sub('root.App.main = ', '', req.text[idx:].split('\n')[0])[:-1])
j['context']
j['context']['dispatcher']['stores']['StreamDataStore']['quoteData'][StockName] #contains Ticker.info
j['context']['dispatcher']['stores']['QuoteSummaryStore'].keys()
dict_keys(['cashflowStatementHistory', 'balanceSheetHistoryQuarterly', 'earnings', 'price', 'incomeStatementHistoryQuarterly', 'incomeStatementHistory', 'balanceSheetHistory', 'cashflowStatementHistoryQuarterly', 'quoteType', 'summaryDetail', 'symbol', 'pageViews'])

The rest of the puzzle to parse it into the table data (so you get the correct ordering and correct string captions) can be found in this javascript file: https://s.yimg.com/uc/finance/dd-site/js/Quote.financials.938c6b86ad7b69ba7927.min.js

nono-london commented 5 years ago

Thanks for the post, very useful. One thing i have noted is that when using proxy, the json, is sometimes truncated and does not contain the information part: j['context'] etc. I guess maybe y! notices robot activity, or maybe it is due to poor quality of the pool of proxies i used... Best

GregoryMorse commented 5 years ago

This already fully implemented in #104

Your problem is probably not the proxy, Yahoo is browser header sensitive since it wants to make a compatible site. So when using requests.get try: my_headers = { 'User-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362' }

For yfinance, probably we should choose the most popular browser header for a very modern browser e.g. Chrome.

nono-london commented 5 years ago

Great help, thanks, look like it solved the problem. Best

On Tue, 24 Sep 2019 at 11:27, Gregory Morse notifications@github.com wrote:

This already fully implemented in #104 https://github.com/ranaroussi/yfinance/pull/104

Your problem is probably not the proxy, Yahoo is browser header sensitive since it wants to make a compatible site. So when using requests.get try: my_headers = { 'User-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362' }

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ranaroussi/yfinance/issues/102?email_source=notifications&email_token=AB245CLM24DJDJ562ZRE4NTQLHMOPA5CNFSM4IYHRUSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7NW5YA#issuecomment-534474464, or mute the thread https://github.com/notifications/unsubscribe-auth/AB245CJULVXF25WGMBEJ5YTQLHMOPANCNFSM4IYHRUSA .

nono-london commented 5 years ago

Hi, I saw your code proposal for this app, but until it is accepted, I am not sure how I can use it. So far i use the above code with loops (which doesn't look optimal...). Any chances that you share it here as a stand alone? All the Best

GregoryMorse commented 5 years ago

I do not see anything sub-optimal about the loops, if you see the JSON data structures returned, you will see a lot of enumeration is required to adapt it. Here is the standalone code:

my_headers = { 'User-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362' }

#demjson.decode() and jsonnet.evaluate_snippet() can parse raw javascript string
def get_financial_translator(vendor):
    """
    #vendor = 'https://s.yimg.com/uc/finance/dd-site/js/vendor.d859a2b02e2b0845735f.min.js'
    req = requests.get(url=vendor)
    r = req.text
    res = re.search('t\.p\+\"\"\+\((.*?)\[e\]\|\|e\)\+\"\.\"\+(.*?)\[e\]\+\"\.min\.js\";', r)
    ks = json.loads(re.sub(r'([\{\s,])(\w+)(:)', r'\1"\2"\3', res[1]))
    vs = json.loads(re.sub(r'([\{\s,])(\w+)(:)', r'\1"\2"\3', res[2]))
    fncneUrl = vs[list(ks.keys())[list(ks.values()).index('Quote.financials')]]
    url = 'https://s.yimg.com/uc/finance/dd-site/js/Quote.financials.' + fncneUrl + '.min.js'
    req = requests.get(url=url)
    strs = req.text.split('e.exports=')
    objs = [json.loads(re.sub(':!0', ':true', re.sub(r'([\{\s,])(\w+)(:)', r'\1"\2"\3', re.sub('}(}\);|,\d+:function\(e,t\){)', '', strs[n])))) for n in range(6, 9)]
    ft = {n['item']:[[x['items'] if 'items' in x else '',x['title'],x['isDate'] if 'isDate' in x else False] for x in n['config']] for n in objs}
    print(ft)
    """
    #valid financial translator as of 2019/9/24
    ft = {'incomeStatement': [['endDate', 'REVENUE', True], ['totalRevenue', 'TOTAL_REVENUE', False], ['costOfRevenue', 'COST_OF_REVENUE', False], ['grossProfit', 'GROSS_PROFIT', False], ['', 'OPERATING_EXPENSES', False], ['researchDevelopment', 'RESEARCH_DEVELOPMENT', False], ['sellingGeneralAdministrative', 'SELLING_GEN_ADMIN', False], ['nonRecurring', 'NON_RECURRING', False], ['otherOperatingExpenses', 'OTHERS', False], ['totalOperatingExpenses', 'TOTAL_OPERATING_EX', False], ['operatingIncome', 'OPERATING_INCOME_LOSS', False], ['', 'INCOME_FROM_CONTINUING_OPS', False], ['totalOtherIncomeExpenseNet', 'TOTAL_OTHER_INCOME_EXPENSES_NET', False], ['ebit', 'EARNINGS_BEFORE_INTEREST_TAX', False], ['interestExpense', 'INTEREST_EXPENSE', False], ['incomeBeforeTax', 'INCOME_BEFORE_TAX', False], ['incomeTaxExpense', 'INCOME_TAX_EXPENSE', False], ['minorityInterest', 'MINORITY_INTEREST', False], ['netIncomeFromContinuingOps', 'NET_INCOME_FROM_CONTINUING_OPS', False], ['', 'NON_RECURRING_EVENTS', False], ['discontinuedOperations', 'DISCONTINUED_OPS', False], ['extraordinaryItems', 'EXTRAORDINARY_ITEMS', False], ['effectOfAccountingCharges', 'EFFECT_OF_ACCOUNTING_CHANGES', False], ['otherItems', 'OTHER_ITEMS', False], ['', 'NET_INCOME_TITLE', False], ['netIncome', 'NET_INCOME', False], ['preferredStock', 'PREFERRED_STOCK_OTHER_ADJ', False], ['netIncomeApplicableToCommonShares', 'NET_INCOME_APPLICABLE_TO_COMMON_SHARES', False]], 'balanceSheet': [['endDate', 'PERIOD_ENDING', True], ['', 'CURRENT_ASSETS', False], ['cash', 'CASH_AND_CASH_EQUIVALENTS', False], ['shortTermInvestments', 'SHORT_TERM_INVESTMENTS', False], ['netReceivables', 'NET_RECEIVABLES', False], ['inventory', 'INVENTORY', False], ['otherCurrentAssets', 'OTHER_CURRENT_ASSETS', False], ['totalCurrentAssets', 'TOTAL_CURRENT_ASSETS', False], ['longTermInvestments', 'LONG_TERM_INVESTMENTS', False], ['propertyPlantEquipment', 'PROPERTY_PLANT_AND_EQUIPMENT', False], ['goodWill', 'GOODWILL', False], ['intangibleAssets', 'INTANGIBLE_ASSETS', False], ['accumulatedAmortization', 'ACCUMULATED_AMORTIZATION', False], ['otherAssets', 'OTHER_ASSETS', False], ['deferredLongTermAssetCharges', 'DEFERRED_LONG_TERM_ASSET_CHARGES', False], ['totalAssets', 'TOTAL_ASSETS', False], ['', 'CURRENT_LIABILITIES', False], ['accountsPayable', 'ACCOUNTS_PAYABLE', False], ['shortLongTermDebt', 'SHORT_CURRENT_LONG_TERM_DEBT', False], ['otherCurrentLiab', 'OTHER_CURRENT_LIABILITIES', False], ['totalCurrentLiabilities', 'TOTAL_CURRENT_LIABILITIES', False], ['longTermDebt', 'LONG_TERM_DEBT', False], ['otherLiab', 'OTHER_LIABILITIES', False], ['deferredLongTermLiab', 'DEFERRED_LONG_TERM_LIABILITY_CHARGES', False], ['minorityInterest', 'MINORITY_INTEREST', False], ['negativeGoodWill', 'NEGATIVE_GOODWILL', False], ['totalLiab', 'TOTAL_LIABILITIES', False], ['', 'STOCKHOLDERS_EQUITY', False], ['stockOptionWarrants', 'MISC_STOCKS_OPTIONS_WARRANTS', False], ['redeemablePreferredStock', 'REDEEMABLE_PREFERRED_STOCK', False], ['redeemablePreferredStock', 'PREFERRED_STOCK', False], ['commonStock', 'COMMON_STOCK', False], ['retainedEarnings', 'RETAINED_EARNINGS', False], ['treasuryStock', 'TREASURY_STOCK', False], ['capitalSurplus', 'CAPITAL_SURPLUS', False], ['otherStockholderEquity', 'OTHER_STOCKHOLDER_EQUITY', False], ['totalStockholderEquity', 'TOTAL_STOCKHOLDER_EQUITY', False], ['netTangibleAssets', 'NET_TANGIBLE_ASSETS', False]], 'cashflowStatement': [['endDate', 'PERIOD_ENDING', True], ['netIncome', 'NET_INCOME', False], ['', 'OPERATING_ACTIVITIES_CASHFLOWS_PROVIDED', False], ['depreciation', 'DEPRECIATION', False], ['changeToNetincome', 'ADJUSTMENT_TO_NET_INCOME', False], ['changeToAccountReceivables', 'CHANGES_IN_ACCOUNTS_RECEIVABLES', False], ['changeToLiabilities', 'CHANGES_IN_LIABILITIES', False], ['changeToInventory', 'CHANGES_IN_INVENTORIES', False], ['changeToOperatingActivities', 'CHANGES_IN_OTHER_OPERATING_ACT', False], ['totalCashFromOperatingActivities', 'TOTAL_CASH_FLOW_FROM_OP_ACT', False], ['', 'INVESTING_ACTIVITIES_CASHFLOWS_PROVIDED', False], ['capitalExpenditures', 'CAPITAL_EX', False], ['investments', 'INVESTMENTS', False], ['otherCashflowsFromInvestingActivities', 'OTHER_CASHFLOWS_FROM_INVESTING_ACT', False], ['totalCashflowsFromInvestingActivities', 'TOTAL_CASH_FLOW_FROM_INVEST_ACT', False], ['', 'FINANCING_ACTIVITIES_CASHFLOWS_PROVIDED', False], ['dividendsPaid', 'DIVIDENDS_PAID', False], ['salePurchaseOfStock', 'SALE_PURCHASE_OF_STOCK', False], ['netBorrowings', 'NET_BORROWINGS', False], ['otherCashflowsFromFinancingActivities', 'OTHER_CASHFLOWS_FROM_FINANCING_ACT', False], ['totalCashFromFinancingActivities', 'TOTAL_CASH_FLOW_FROM_FIN_ACT', False], ['effectOfExchangeRate', 'EFFECT_OF_EXCHANGE_RATE_CHANGES', False], ['changeInCash', 'CHANGE_IN_CASH_AND_EQ', False]]}
    return ([('incomeStatementHistory','incomeStatementHistory','incomeStatement'),
        ('cashflowStatementHistory','cashflowStatements','cashflowStatement'),
        ('balanceSheetHistory','balanceSheetStatements','balanceSheet'),
        ('incomeStatementHistoryQuarterly','incomeStatementHistory','incomeStatement'),
        ('cashflowStatementHistoryQuarterly','cashflowStatements','cashflowStatement'),
        ('balanceSheetHistoryQuarterly','balanceSheetStatements','balanceSheet')],
        ft)

#url = '%s/%s/%s' % (scrape_url, StockName, 'sustainability')
#q['esgScores'].keys()
#https://jsonformatter.org/
def get_quarterly_financials(StockName, tz=None):
  scrape_url = 'https://finance.yahoo.com/quote'
  url = '%s/%s/%s' % (scrape_url, StockName, 'financials')
  req = requests.get(url=url, headers = my_headers)
  parseTag = 'root.App.main = '
  r = req.text
  idx = r.find(parseTag)
  j = json.loads(r[idx:].split('\n')[0][len(parseTag):][:-1])
  #dct = j['context']['dispatcher']['stores']['StreamDataStore']['quoteData'][StockName] #contains Ticker.info but less complete and with a UUID
  #inf = {x:(dct[x]['raw'] if type(dct[x]) is dict else dct[x]) for x in dct}
  q = j['context']['dispatcher']['stores']['QuoteSummaryStore']
  #{x:((q['price'][x]['raw'] if 'raw' in q['price'][x] else '') if type(q['price'][x]) is dict else q['price'][x]) for x in q['price']}
  (fncls, ft) = get_financial_translator(re.search('https://s\.yimg\.com/uc/finance/dd-site/js/vendor\..*?\.min\.js', req.text)[0])
  strings = j['context']['dispatcher']['stores']['LangStore']['baseLangs']['td-app-finance']
  dfs = []
  for (nm, sbnm, knm) in fncls:
    if not nm in q or not sbnm in q[nm]: df = pd.DataFrame()
    else:
        df = [[strings[x[1]] if not x[2] else ''] + [((q[nm][sbnm][n][x[0]]['raw'] if not x[2] else q[nm][sbnm][n][x[0]]['raw']) if 'raw' in q[nm][sbnm][n][x[0]] else '') if x[0] in q[nm][sbnm][n] else '-' for n in range(len(q[nm][sbnm]))] for x in ft[knm]]
        df = pd.DataFrame(df[1:], None, df[0])
        df.set_index('', inplace=True)
        df.columns = pd.to_datetime(df.columns, unit="s")
        if not tz is None: df.columns = df.columns.tz_localize(tz)
        df = df.where((df != '-') & (df != '')).astype(float)
    dfs.append(df)
  return dfs