robertmartin8 / MachineLearningStocks

Using python and scikit-learn to make stock predictions
MIT License
1.74k stars 506 forks source link

Adaptation request #27

Closed DebugMeIfYouCan closed 5 years ago

DebugMeIfYouCan commented 5 years ago

Hello, I luckily ended up on your project as I'm looking at scraping data from Yahoo Finance for a list of quotes (not only S&P500). I was wondering if there was a way to get a part of your script adapted to my needs? i.e. I've got a list of quotes available in a .txt file. I currently use the YahooFinancials python api but I realised that some key figures are missing, such as "Cash, Debt, Levered free cash flow...etc". So far, I'm collecting the data using that custom python script and then dump as a JSON file. Would you be able to help me? Thanks :)

robertmartin8 commented 5 years ago

Hi,

I took a look at the YahooFinance API, and it looks really good – I would suggest that you stick with that. Every data source, bar the expensive professional ones, will have missing data. You should have a preprocessing script that cleans the missing data, either by filling it with suitable estimates or by ignoring those rows/columns.

Best, Robert

DebugMeIfYouCan commented 5 years ago

Hi Robert, thanks for your insight! :) Could you confirm that's the API you looked at: https://github.com/JECSand/yahoofinancials please? I know I'll end with missing data no matter what I do in the "free" world, but I would like to be able to back up my first batch of data with some kind of fallback. My idea would be to first scrape the Yahoo Finance website the way you do it as it's got most of what I need, and then fallback to the YahooFinancials API for what's missing. I already have that second part (fetching data from YahooFinancials), but need to put in place the first part (scraping Yahoo the way you do it) and then the fallback system.

The main issue I've got with YahooFinancials is the fact that main figures are missing, such as Cash, Debt or Levered Free Cash Flow. But those are things you seem to provide.

Would you be able to help or suggest what could be used from your project please? Happy to share my code if you want to have a quick look at it. Cheers :)

DebugMeIfYouCan commented 5 years ago

Here is my current script:

import os
import json
try:
    from yahoofinancials import YahooFinancials
except ImportError:
    os.system('pip install yahoofinancials')

def ask_keys():
    #print('!Enter all inputs in lower case form!')
    summary_keys = input('Enter the keys for summary, each seperated by a comma and a space: ')
    stats_keys = input('Enter the keys for stats, each seperated by a comma and a space: ')
    financial_keys = input('Enter the keys for financial, each seperated by a comma and a space: ')
    summary_keys_ = summary_keys.split(',')
    stats_keys_ = stats_keys.split(',')
    financial_keys_ = financial_keys.split(',')

    summary_key_list = []
    stats_key_list = []
    financial_key_list = []

    for key in summary_keys_:
        summary_key_list.append(key.strip())
        #print(summary_key_list)
    for key in stats_keys_:
        stats_key_list.append(key.strip())
    for key in financial_keys_:
        financial_key_list.append(key.strip())

    print('-----------------------------------')
    return summary_key_list, stats_key_list, financial_key_list
def main():
    #To get the tickers list

    #all you need to do is edit the list for the specific summary/stat/financial data you need.
    #by simply replacing the keys that are necessary to you.
    summary_keys_for_lookup = ['previousClose', 'dividendRate', 'dividendYield', 'marketCap', 'forwardPE' ] #if you need a specific key to appear in the summary part, just add it to this list, if you want to remove a specific key, just remove it here as well
    stat_keys_for_lookup = ['forwardEps', 'trailingEps', 'floatShares'] #if you need a specific key to appear in the stats part, just add it to this list, if you want to remove a specific key, just remove it here as well
    financial_keys_for_lookup = ['cash', 'longTermDebt', 'totalCashFromOperatingActivities', 'capitalExpenditures'] #if you need a specific key to appear in the financial part, just add it to this list, if you want to remove a specific key, just remove it here as well
    #-------------------------------------------------------------------------------------------------

    summary_key_list, stats_key_list, financial_key_list = summary_keys_for_lookup, stat_keys_for_lookup, financial_keys_for_lookup
    mytickers = []
    with open('quotes.txt') as file:
        lines = file.readlines()
        for line in lines:
            if line != '\n':
                mytickers.append(line.strip())
    print(mytickers)

    #myquotes = open("quotes.txt","r")
    #mytickers = myquotes.read().splitlines()
    #print(mytickers)

    yahoo_financials = YahooFinancials(mytickers)

    #To fetch the data from the API
    summary = yahoo_financials.get_summary_data()
    print('Getting summary data...')
    stats = yahoo_financials.get_key_statistics_data()
    print('Getting stats data...')
    financial = yahoo_financials.get_financial_stmts('annual', ['income', 'cash', 'balance'])
    print('Getting financial data...')
    #To build the data set
    '''
    data_output = {
        'summary': summary,
        'stats': stats,
        'financial': financial
    }
    '''
    data_dct = {}
    for key in summary:
        #print(key, '\t', summary[key])
        #data_dct[key] =
        summ = summary[key]
        data_dct[key] = {}
        for key_ in summary_key_list:
            try:
                for key__ in summ:
                    if key_.lower() == key__.lower():
                        data_dct[key][key_] = summ[key__]

            except KeyError:
                continue

    for key in stats:
        #print(key, '\t', stats[key])
        # data_dct[key] =
        stat = stats[key]
        #data_dct[key] = {}
        for key_ in stats_key_list:
            try:
                for key__ in stat:
                    if key_.lower() == key__.lower():
                        data_dct[key][key_] = stat[key__]
            except KeyError:
                continue

    for ticker in mytickers:
        for key in financial:
            for k_ in financial[key]:
                fin = financial[key][k_]
                for k__ in fin:
                    for nk in k__:
                        financ = k__[nk]
                        for key_ in financial_key_list:
                            try:
                                data_dct[k_][key_] = financ[key_]
                            except KeyError:
                                continue

    #json_obj = json.dumps(data_dct, indent=7)
    #print(json_obj)

    #JSON file output
    with open("file.json", "w") as f:
        json.dump(data_dct, f, indent=2)

    #To print the data set
# print(data_output)/
main()
robertmartin8 commented 5 years ago

Yeah, that's the API I was looking at. Was the data only missing for non S&P500 stocks?

DebugMeIfYouCan commented 5 years ago

@robertmartin8 I need the missing data for S&P500 and other US stocks available on Yahoo. I would like to be able to scrape for example the Levered Free Cash Flow for ticker MMM (S&P500) as well as VOD (not in the S&P500). Just get what's in the statistics page of Yahoo Finance as long as the ticker is available. Let me know if your script does it and if you could eventually help me to integrate this feature to my existing script.

robertmartin8 commented 5 years ago

This project is more about processing the data and training a machine learning classifier, so unfortunately it does not include the script used to scrape the data.

Best of luck! Robert