pnaves commented 6 years ago

Hello,

I've got the following error when I try to run stock_prediiction.py I already tried in Linux Centos 7 and Windows 10 my python version is 3.6.5 I followed all the instructions step by step . The others files runs fine.

[root@customiseta MachineLearningStocks]# python3.6 stock_prediction.py Building dataset and predicting stocks... Traceback (most recent call last): File "stock_prediction.py", line 55, in predict_stocks() File "stock_prediction.py", line 42, in predict_stocks y_pred = clf.predict(X_test) File "/usr/lib64/python3.6/site-packages/sklearn/ensemble/forest.py", line 538, in predict proba = self.predict_proba(X) File "/usr/lib64/python3.6/site-packages/sklearn/ensemble/forest.py", line 578, in predict_proba X = self._validate_X_predict(X) File "/usr/lib64/python3.6/site-packages/sklearn/ensemble/forest.py", line 357, in _validate_Xpredict return self.estimators[0]._validate_X_predict(X, check_input=True) File "/usr/lib64/python3.6/site-packages/sklearn/tree/tree.py", line 373, in _validate_X_predict X = check_array(X, dtype=DTYPE, accept_sparse="csr") File "/usr/lib64/python3.6/site-packages/sklearn/utils/validation.py", line 462, in check_array context)) ValueError: Found array with 0 sample(s) (shape=(0, 41)) while a minimum of 1 is required.

robertmartin8 commented 6 years ago

Hi, sorry for the late response.

The error seems to be saying that you are trying to predict without any input. You could confirm this by inserting a print(X), and if you see an empty array/df then the problem is probably that data wasn't downloaded properly.

Just in case it's an sklearn error, can I check what version you're using? Try running the following in terminal:

pip show scikit-learn

Robert

pnaves commented 6 years ago

Hello Robert,

I didn't run the first script "download_historical_prices.py" because you said in the description of the project that there is an error in Yahoo Finance, so I just used the included files in the project: "sp500_index.csv "and "stock_prices.csv. I thought that would run fine, once the input data was present in the project. However the codes that gives value to variable "X_test", always return empty:

features = data.columns[6:]  //data has values
X_test = data[features].values  // X-test is always empty

Im a beginner in python and I cant understand the code well, cause the variable data is a vector with a lot of values.

When I run "pip show strikit-learn" I got the following report :

Name: scikit-learn Version: 0.19.1 Summary: A set of python modules for machine learning and data mining Home-page: http://scikit-learn.org Author: Andreas Mueller Author-email: amueller@ais.uni-bonn.de License: new BSD Location: c:\program files\python\python36\lib\site-packages Requires: Required-by:v

Regards, Pedro

robertmartin8 commented 6 years ago

Hi Pedro,

Can I check whether you downloaded the fundamental data? This is different to the stock price data, and is what the algorithm attempts to learn from. Please see this part of the readme for more:

https://github.com/robertmartin8/MachineLearningStocks#historical-stock-fundamentals

Best, Robert

pnaves commented 6 years ago

Hello Robert,

Note 1: The original script "download_historical_prices.py" can download the file "sp500_index.csv" but cant download "stock_prices.csv" However Im not worried to automatic download the data at this moment. I just want to be able to run the final script stock_prediction.py. I manually downloaded the file ^GSPC.cps from yahoo finance and I renamed him to sp500_index.csv. Now the error appear in the backtesting.py:

D:\python_projects\MachineLearningStocks>python backtesting.py C:\Program Files\Python\Python36\lib\site-packages\sklearn\ensemble\weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release. from numpy.core.umath_tests import inner1d Traceback (most recent call last): File "backtesting.py", line 82, in backtest() File "backtesting.py", line 42, in backtest clf.fit(X_train, y_train) File "C:\Program Files\Python\Python36\lib\site-packages\sklearn\ensemble\forest.py", line 247, in fit X = check_array(X, accept_sparse="csc", dtype=DTYPE) File "C:\Program Files\Python\Python36\lib\site-packages\sklearn\utils\validation.py", line 462, in check_array context)) ValueError: Found array with 0 sample(s) (shape=(0, 41)) while a minimum of 1 is required.

Can you include the csv data files that really work in the project source for test purpose. I`m wondering why your original .csv data files not works?

Thanks, Pedro

pnaves commented 6 years ago

Hello Robert,

I found out the error. The generated file forward_sample.csv has a column date and all the rows of this columns has the value 0. This field is being used as an index in the upcoming scripts. Do you know why this file is being generated zeroed? This time I executed all files, included download_historical_prices.py, but the file sp500_index.csv was empty after this step so I downloaded him manually. from yahoo finance and renamed. After that, I Just got an error message in the last file "stock_predition.py", but I think the error is in the previews files. Seems easy to follow the instructions and run the scripts, but the errors is very difficult to find, even using debug tool like pycharm.

Tks, Pedro

robertmartin8 commented 6 years ago

As per my previous comment, it appears you haven't downloaded the fundamental data. As specified in the readme, you must download the fundamental data from this link. The fundamental data is different to sp500_index.csv and stock_prices.csv.

Robert

pnaves commented 6 years ago

I had already download intraquarter. Are there others files to download? My problem is not related to intraquarter

robertmartin8 commented 6 years ago

Were you able to run backtesting.py?
When you try to run current_data.py, what happens? In order to generate predictions, we need to download the most recent financial data, to which we will apply our trained model.

pnaves commented 6 years ago

1) YES

Classifier performance

Accuracy score: 0.80 Precision score: 0.81

Stock prediction performance report

Total Trades: 161 Average return for stock predictions: 39.1 % Average market return in the same period: 9.1% Compared to the index, our strategy earns 30.0 percentage points more

2) a)current_data.py goes to 100%, however I got some encode errors in Windows but keep processing til 100%. In linux we don't have this encode warnings and reach 100%, but the stock_predictions.py result the same error in both, linux and windows
b) I just want to be able to run all script to study the code, so Im using the included data files sp500_index.csv and sotck_prices.csv. I tried to run download_.stock_prices.py, but sp500_index.csv was empty at the end of this processing, although the percentage reach 100%. Im not worried about the age of the result data at this moment, so if you can include some data files that is actually working in the project, would work for me .

Pedro

robertmartin8 commented 6 years ago

Hmmm the fact that the backtesting works means that the price data, index data and fundamental data are fine.

So the only issue is the latest data. After you run current_data.py, do you see a new directory called forward/? Can you please look inside and see whether the html files are there? You mention that you've got "encode warnings" – could you attach a sample?

current_data.py first downloads html into this directory, then parses that html into the dataset in forward_sample.csv, so if your forward sample is empty it means that either the download or the parsing has gone wrong. The issue is not with stock_prediction.py.

ecbc1 commented 6 years ago

Looks like I got the same error when running stock_prediction.py.

I do see the forward/ directory created with contents. I'll attached one of the htmls but I suspect the htmls are working. I renamed it to a txt extension so I could attach it. I don't remember getting any encoding errors (i'm running on linux). aapl.txt

My forward_sample.csv exists. I'm wondering if the parsing by current_data.py is OK. I'll attach a few lines of my forward_sample.csv for your evaluation or I can just attach the csv file. thx!

0,0,AAPL,0,0,0,0,20.21,20.21,20.21,16.44,1.46,4.22,9.4,4.49,14.6,21.98,26.6,12.22,45.37,255270000000.0,50.63,17.3,88190000000.0,78530000000.0,56120000000.0,11.04,32.1,70970000000.0,14.69,114600000000.0,99.7,1.31,23.74,73030000000.0,41440000000.0,1.26,208.96,187.64,25380000.0,4830000000.0,4570000000.0,0.07,61.33,42160000.0,1.61,0.83, 0,0,ABBV,0,0,0,0,141950000000.0,177200000000.0,23.27,10.49,0.75,4.59,,5.73,13.68,20.84,36.57,11.0,489.67,30950000000.0,19.49,19.2,21530000000.0,12950000000.0,6420000000.0,4.03,3.6,3740000000.0,2.47,37750000000.0,,0.8,-2.23,11370000000.0,9330000000.0,1.59,95.05,99.69,6140000.0,1510000000.0,1510000000.0,0.09,71.54,32580000.0,4.5,2.34,

robertmartin8 commented 6 years ago

That's very odd. The html files are fine as you say, and there seems to be nothing wrong with forward_sample (so I think current_data is doing fine). Going back to what you said earlier:

features = data.columns[6:]  //data has values
X_test = data[features].values  // X-test is always empty

This seems very strange. If data has values, I can't understand why a slice of it is empty. Could you do me a favour and print a bit of each? So change those lines to

print(data)
features = data.columns[6:]
print(features)
X_test = data[features].values
print(X_test)

I'd love to get to the bottom of this. We've narrowed it down to probably stock_prediction.py.

ecbc1 commented 6 years ago

So I added the print statements and below is what I get:

$ python3.6 stock_prediction.py Building dataset and predicting stocks... Empty DataFrame Columns: [Unix, Ticker, Price, stock_p_change, SP500, SP500_p_change, Market Cap, Enterprise Value, Trailing P/E, Forward P/E, PEG Ratio, Price/Sales, Price/Book, Enterprise Value/Revenue, Enterprise Value/EBITDA, Profit Margin, Operating Margin, Return on Assets, Return on Equity, Revenue, Revenue Per Share, Quarterly Revenue Growth, Gross Profit, EBITDA, Net Income Avi to Common, Diluted EPS, Quarterly Earnings Growth, Total Cash, Total Cash Per Share, Total Debt, Total Debt/Equity, Current Ratio, Book Value Per Share, Operating Cash Flow, Levered Free Cash Flow, Beta, 50-Day Moving Average, 200-Day Moving Average, Avg Vol (3 month), Shares Outstanding, Float, % Held by Insiders, % Held by Institutions, Shares Short, Short Ratio, Short % of Float, Shares Short (prior month)] Index: []

[0 rows x 47 columns] Index(['Market Cap', 'Enterprise Value', 'Trailing P/E', 'Forward P/E', 'PEG Ratio', 'Price/Sales', 'Price/Book', 'Enterprise Value/Revenue', 'Enterprise Value/EBITDA', 'Profit Margin', 'Operating Margin', 'Return on Assets', 'Return on Equity', 'Revenue', 'Revenue Per Share', 'Quarterly Revenue Growth', 'Gross Profit', 'EBITDA', 'Net Income Avi to Common', 'Diluted EPS', 'Quarterly Earnings Growth', 'Total Cash', 'Total Cash Per Share', 'Total Debt', 'Total Debt/Equity', 'Current Ratio', 'Book Value Per Share', 'Operating Cash Flow', 'Levered Free Cash Flow', 'Beta', '50-Day Moving Average', '200-Day Moving Average', 'Avg Vol (3 month)', 'Shares Outstanding', 'Float', '% Held by Insiders', '% Held by Institutions', 'Shares Short', 'Short Ratio', 'Short % of Float', 'Shares Short (prior month)'], dtype='object') [] Traceback (most recent call last): File "stock_prediction.py", line 58, in predict_stocks() File "stock_prediction.py", line 45, in predict_stocks y_pred = clf.predict(X_test) File "/usr/lib64/python3.6/site-packages/sklearn/ensemble/forest.py", line 538, in predict proba = self.predict_proba(X) File "/usr/lib64/python3.6/site-packages/sklearn/ensemble/forest.py", line 578, in predict_proba X = self._validate_X_predict(X) File "/usr/lib64/python3.6/site-packages/sklearn/ensemble/forest.py", line 357, in _validate_Xpredict return self.estimators[0]._validate_X_predict(X, check_input=True) File "/usr/lib64/python3.6/site-packages/sklearn/tree/tree.py", line 373, in _validate_X_predict X = check_array(X, dtype=DTYPE, accept_sparse="csr") File "/usr/lib64/python3.6/site-packages/sklearn/utils/validation.py", line 462, in check_array context)) ValueError: Found array with 0 sample(s) (shape=(0, 41)) while a minimum of 1 is required.

robertmartin8 commented 6 years ago

Ah so actually the dataframe data is empty. So we have narrowed the problem down to

    data = pd.read_csv('forward_sample.csv', index_col='Date')
    data.dropna(axis=0, how='any', inplace=True)
    features = data.columns[6:]

You mention that your forward_sample.csv is fine (you can check this by opening in excel or something and just making sure it's got a lottt of numbers in it). If that's the case, then by elimination the problem must be in the data.dropna() (though I wouldn't have thought it).

So what I think might be happening is that your copy of forward has one column that is full of NaN values (you can check this by looking for an empty column in excel). Then, because we are dropping any rows with NaNs, we end up dropping the whole dataframe. To confirm this, can you please add the following print:

    data = pd.read_csv('forward_sample.csv', index_col='Date')
    print(data.isnull().sum())
    data.dropna(axis=0, how='any', inplace=True)
    features = data.columns[6:]

After this, you can replace your copy of forward_sample.csv with my test copy (sharing it on dropbox).

I'm hopeful that this'll resolve the issue!

ecbc1 commented 6 years ago

Your forward_sample.csv file worked! After I used your file, the proper output was given.

$ python3.6 stock_prediction.py Building dataset and predicting stocks... Unix 0 Ticker 0 Price 0 stock_p_change 0 SP500 0 SP500_p_change 0 Market Cap 75 Enterprise Value 75 Trailing P/E 146 Forward P/E 76 PEG Ratio 81 Price/Sales 76 Price/Book 97 Enterprise Value/Revenue 76 Enterprise Value/EBITDA 105 Profit Margin 75 Operating Margin 75 Return on Assets 90 Return on Equity 94 Revenue 76 Revenue Per Share 76 Quarterly Revenue Growth 78 Gross Profit 101 EBITDA 105 Net Income Avi to Common 75 Diluted EPS 75 Quarterly Earnings Growth 263 Total Cash 77 Total Cash Per Share 77 Total Debt 91 Total Debt/Equity 132 Current Ratio 110 Book Value Per Share 75 Operating Cash Flow 120 Levered Free Cash Flow 144 Beta 153 50-Day Moving Average 74 200-Day Moving Average 74 Avg Vol (3 month) 75 Shares Outstanding 75 Float 75 % Held by Insiders 79 % Held by Institutions 79 Shares Short 133 Short Ratio 92 Short % of Float 75 Shares Short (prior month) 109 dtype: int64 17 stocks predicted to outperform the S&P500 by more than 10%: NOC FL PH SWK NFX DF LH SCHL DDS AIZ SFLY GME IR M AMP BBBY APD

ecbc1 commented 6 years ago

So I'm comparing your file (TOP) and the one that was generated on my side (BOTTOM). See attached file.

I noticed the last column Shares Short has no data on my side (BOTTOM). My guess is the lack of data in the last column caused the error.

Now the question becomes why was that column not populated. My guess is the formating changed on the html files and the parser in current_data.py couldn't read the data to populate the forward_sample.csv.

What do you think? thx

fwd_sample_diff

robertmartin8 commented 6 years ago

Yeah so the problem is that when you dropna(how='any'), it removes all of the rows.

You're right, it must be a change in the html files.. but if you look at yahoo finance the data is still there. So they must have changed the exact format or name.

ecbc1 commented 6 years ago

So I'm looking at the code and the yahoo finance page. I see a different in the features and what's on the website. Do you think this is the issue? It changed from (prior month) to (prior month Jul 12, 2018)

Thx

yahoo_aapl

ecbc1 commented 6 years ago

I got it to work, the fix is to remove the last ")" The HTML changed. It should look like this in current_data.py :

'Shares Short (prior month']

robertmartin8 commented 6 years ago

Ah, excellent! Thank you for finding the exact issue – I wouldn't really have had the time to look at the html to find the error. Would you like to submit the PR so your contribution can be noted? If not, let me know and I'll push the fix.

Out of interest, when you run pytest -v in the terminal do any of the tests fail (using the old forward_sample)? If not, I may add a test that looks like this:

    data = pd.read_csv('forward_sample.csv', index_col='Date')
    data.dropna(axis=0, how='any', inplace=True)
    assert not (data.isnull().sum() == len(data)).any()

Basically making sure that we don't have any empty columns.

ecbc1 commented 6 years ago

Hi, I forked and pull request and updated the code for current_data.py.

As for the pytest -v, I got a failure in test_datasets.py, so I edited tests/test_datasets.py and removed the parenthesis and it passed. I still got a failure on test_variables.py. I haven't figured out what's wrong here yet and will try to get back to it....

Section of test_datasets.py I modified:

positive_features = ['Market Cap', 'Price/Sales', 'Revenue', 'Revenue Per Share', 'Total Cash', 'Total Cash Per Share', 'Total Debt', '50-Day Moving Average', '200-Day Moving Average', 'Avg Vol (3 month)', 'Shares Outstanding', 'Float', '% Held by Insiders', '% Held by Institutions', 'Shares Short', 'Short Ratio', 'Short % of Float', 'Shares Short (prior month']

$ pytest -v ============================================== test session starts =============================================== platform linux -- Python 3.6.5, pytest-3.4.1, py-1.6.0, pluggy-0.6.0 -- /usr/bin/python3.6 cachedir: .pytest_cache rootdir: /opt/MLS, inifile: collected 9 items

tests/test_datasets.py::test_forward_sample_dimensions PASSED [ 11%] tests/test_datasets.py::test_forward_sample_data PASSED [ 22%] tests/test_datasets.py::test_stock_prices_dataset PASSED [ 33%] tests/test_datasets.py::test_stock_prediction_dataset PASSED [ 44%] tests/test_utils.py::test_status_calc PASSED [ 55%] tests/test_utils.py::test_data_string_to_float PASSED [ 66%] tests/test_variables.py::test_statspath PASSED [ 77%] tests/test_variables.py::test_features_same FAILED [ 88%] tests/test_variables.py::test_outperformance PASSED [100%]

==================================================== FAILURES ==================================================== _ test_featuressame

def test_features_same():
    # There are only four differences (intentionally)

  assert set(parsing_keystats.features) - set(current_data.features) == {'Qtrly Revenue Growth', 'Qtrly Earnings Growth',
'Shares Short (as of', 'Net Income Avl to Common'} E AssertionError: assert {'Net Income ...prior month)'} == {'Net Income A...Short (as of'} E Extra items in the left set: E 'Shares Short (prior month)' E Full diff: E {'Net Income Avl to Common', E 'Qtrly Earnings Growth', E 'Qtrly Revenue Growth', E - 'Shares Short (as of',... E
E ...Full output truncated (5 lines hidden), use '-vv' to show

tests/test_variables.py:17: AssertionError ======================================= 1 failed, 8 passed in 5.07 seconds =======================================

robertmartin8 commented 6 years ago

Ok great, once you've submitted the PR I'll have a quick look and merge it in. The test situation is a bit messy... I guess test_features_same isn't really such a useful test anyway so I might remove/rewrite it in future.

ecbc1 commented 6 years ago

Hi, I updated current_data.py and test_datasets.py. The change was to remove the parenthesis. Please merge after your review. (I'm not sure if I did the github pull right). thx.

The pytest failure in def test_features_same() still exists. I haven't had a chance to figure out what's wrong there but it may not be useful anyway.

robertmartin8 commented 6 years ago

Haven't received the PR yet, you may want to check that you've followed the steps shown in this guide. Don't worry about test_features_same(), I'll try to fix that later.

ecbc1 commented 6 years ago

OK, I think I created the PR this time, let me know if it still doesn't work, thx!

robertmartin8 commented 6 years ago

Alright, I've merged! Thank you for all the detective work in figuring out what went wrong! If you have any other comments or suggested improvements, I'd love to hear them: just raise a new issue (like if you think some parts of the readme were poorly explained).

robertmartin8 / MachineLearningStocks

I've got an error when I try to run stock_prediction.py #17

Classifier performance

Stock prediction performance report