Determine economic datasets

jeff1evesque commented 5 years ago

We need to determine additional required economic datasets.

jeff1evesque commented 5 years ago

We need to clear our RStudio workspace, remove corresponding custom packages, then run our app.R.

jeff1evesque commented 5 years ago

The following segment in our name_to_ticker.R generates the following error:

> df = load_data_fin654(
+     paste0(cwd, '/data/data-breaches.csv'),
+     paste0(cwd, '/data/Privacy_Rights_Clearinghouse-Data-Breaches-Export.csv'),
+     paste0(cwd, '/python/dataframe.py')
+   )
> tickers = name_to_ticker(
+     df$company,
+     c(
+       paste0(cwd, '/data/amex.csv'),
+       paste0(cwd, '/data/nasdaq.csv'),
+       paste0(cwd, '/data/nyse.csv')
+     ),
+     c(paste0(cwd, '/python/dataframe.py'), paste0(cwd, '/python/name_to_ticker.py'))
+   )
Error in py_call_impl(callable, dots$args, dots$keywords) : 
  KeyError: "['name' 'symbol'] not in index"

Detailed traceback: 
  File "<string>", line 14, in name_to_ticker
  File "C:\Python36\lib\site-packages\pandas\core\frame.py", line 2682, in __getitem__
    return self._getitem_array(key)
  File "C:\Python36\lib\site-packages\pandas\core\frame.py", line 2726, in _getitem_array
    indexer = self.loc._convert_to_indexer(key, axis=1)
  File "C:\Python36\lib\site-packages\pandas\core\indexing.py", line 1327, in _convert_to_indexer
    .format(mask=objarr[mask]))

Specifically, the following python segments fails:

def name_to_ticker(series, ref, col_1, col_2):
    '''

    convert list of company names to list of tickers.

    @col_1, column converted to index in dict
    @col_2, column converted to value in dict

    '''

    references = ref[[col_1, col_2]].set_index(col_1).to_dict()
    return([x if x not in references else references[x] for x in series])

jeff1evesque commented 5 years ago

d240477: the adjusted code returns the company names, not the desired company ticker.

jeff1evesque commented 5 years ago

The following logic returns a list of NULL:

    references = ref[[col_1, col_2]].set_index(col_1).to_dict()
    return([references[x] if x in references else None for x in series])

This suggests that no elements in the series exists in the constructed references. Therefore, more effort needs to verify whether references is properly constructed.

jeff1evesque commented 5 years ago

Temporarily adjusting name_to_ticker.py:

def name_to_ticker(series, ref, col_1, col_2):
    '''

    convert list of company names to list of tickers.

    @col_1, column converted to index in dict
    @col_2, column converted to value in dict

    '''
    return(ref[[col_1, col_2]])

Returns the desired structure:

name-ticker

This suggests to_dict() does not convert the above dataframe to a dict, which prevents the if case to succeed in the list comprehension [references[x] if x in references else None for x in series].

jeff1evesque commented 5 years ago

Replacing the earlier return with either of the following implementation:

    return(ref[[col_1, col_2]].loc[series])
    return(ref[[col_1, col_2]].loc[series][[col_2]])

Generates the following error traceback:

Error in py_call_impl(callable, dots$args, dots$keywords) : 
  KeyError: 'None of [[\'cathay pacific airways\', \'chinese resume leak\', \'blur\', \'blank media games\', \'wordpress\', \'google+\', \'quora\', \'marriott hotels\', \'nmbs\', \'facebook\', \'panerabread\', \'aadhaar\', \'dixons carphone\', \'myheritage\', \'saks and lord & taylor\', \'careem\', \'texas voter records\', \'british airways\', \'t-mobile\', \'myfitnesspal\', \'health south east\', \'nametests\', \'ticketmaster\', \'firebase\', \'aadhaar\', \'grindr\', \'orbitz\', \'mbm company\', \'localblox\', \'twitter\', \'viewfines\', \'ticketfly\', \'amazon\', \'amazon\', \'urban massage\', \'dell\', \'high tail hall\', \'sky brasil\', \'vision direct\', \'healthcare.gov\', \'cms\', \'facebook\', \'newegg\', \'disqus\', \'rootsweb\', \'yahoo\', \'uber\', \'wonga\', \'snapchat\', \'spambot\', \'cex\', \'al.type\', \'cellebrite\', \'waterly\', \'swedish transport agency\', \'hong kong registration & electoral office\', \'river city media\', \'dafont\', \'bell\', \'zomato\',

jeff1evesque commented 5 years ago

f80d986: we were able to ensure name_to_ticker returns corresponding ticker names:

tickers

However, the successive logic to store the matching ticker names back to the original dataframe, generated an error regarding a mismatch of row size:

> df$ticker = tickers
Error in `$<-.data.frame`(`*tmp*`, ticker, value = c(`525` = "aapl", `879` = "celg",  : 
  replacement has 58 rows, data has 2760

jeff1evesque commented 5 years ago

Changing dataframe.py to the following:

    def set_column(self, column, ref, new_key):
        '''

        Append column values exists in the provided reference, append corresponding
        values into a new 'column' on the current dataframe. 

        '''

        return(print([x['name'] for i,x in ref.iterrows()]))
        vals = [x['symbol'] if x['name'] in self.df[column] else None for i,x in ref.iterrows()]
        #self.df[new_key] = vals

produces the following company names:

['apple inc.', 'celgene corporation', 'celgene corporation', 'copart, inc.', 'docusign, inc.', 'facebook, inc.', 'intuit inc.', 'marriott international', 'multi-color corporation', 'nvidia corporation', 'performant financial corporation', 'sabre corporation', 'the madison square garden company', 'aecom', 'american express company', 'broadridge financial solutions, inc.', 'citigroup inc.', 'citigroup inc.', 'citigroup inc.', 'citigroup inc.', 'citigroup inc.', 'citigroup inc.', 'delta air lines, inc.', 'discover financial services', 'dollar general corporation', 'first data corporation', 'first republic bank', 'first republic bank', 'first republic bank', 'first republic bank', 'first republic bank', 'first republic bank', 'genesco inc.', 'global payments inc.', 'kb home', 'kbr, inc.', 'keycorp', 'keycorp', 'keycorp', 'morgan stanley', 'morgan stanley', 'morgan stanley', 'morgan stanley', 'morgan stanley', 'morgan stanley', 'morgan stanley', 'occidental petroleum corporation', 'perkinelmer, inc.', 'qvc, inc.', 'rite aid corporation', 'rollins, inc.', 'stanley black & decker, inc.', 'stanley black & decker, inc.', 'stanley black & decker, inc.', 'suntrust banks, inc.', 'suntrust banks, inc.', 'the madison square garden company', 'weyerhaeuser company']

However, changing dataframe.py:

    def set_column(self, column, ref, new_key):
        '''

        Append column values exists in the provided reference, append corresponding
        values into a new 'column' on the current dataframe. 

        '''

        return(print(self.df[column]))
        vals = [x['symbol'] if x['name'] in self.df[column] else None for i,x in ref.iterrows()]
        #self.df[new_key] = vals

Produces the following output:

17100        occidental petroleum corporation
12010                                 kb home
12210                                 keycorp
393                      rite aid corporation
428                            citigroup inc.
566                      weyerhaeuser company
568                              copart, inc.
625                                   keycorp
718                       celgene corporation
810                                 kbr, inc.
823      broadridge financial solutions, inc.
879                            citigroup inc.
1016                     rite aid corporation
1099                   first data corporation
1113                           docusign, inc.
1197                      first republic bank
1252                     rite aid corporation
1347             stanley black & decker, inc.
1375                            rollins, inc.
1386                     global payments inc.
1426                               apple inc.
1451                             genesco inc.
1526              discover financial services
1527              discover financial services
1535              discover financial services
1547                 american express company
1559              discover financial services
1588                 american express company
1647                                    aecom
1716                       nvidia corporation
1788                        sabre corporation
1792                           morgan stanley
1912                        perkinelmer, inc.
1957                  multi-color corporation
2048                                qvc, inc.
2050        the madison square garden company
2167                        sabre corporation
2184                        sabre corporation
2194                        sabre corporation
2219         performant financial corporation
2342               dollar general corporation
2359                              intuit inc.
2375                    delta air lines, inc.
2380                     suntrust banks, inc.
2406                           facebook, inc.
2454                           facebook, inc.
2457                   marriott international
Name: company, dtype: object

jeff1evesque commented 5 years ago

The following variant in the dataframe.py suffices:

    def set_column(self, column, ref, new_key):
        '''

        Append column values exists in the provided reference, append corresponding
        values into a new 'column' on the current dataframe. 

        '''

        ## only 'ref' contains stock symbols
        results = []
        for i, x in self.df.iterrows():
            if x[column] in ref['name'].values:
                results.append(ref.loc[ref['name'] == x[column], 'symbol'].iloc[0])

#        vals = [ref.loc[ref['name'] == x[column]] for i,x in self.df.iterrows() if x[column] in ref['name'].values]
        self.df[new_key] = results

Converting the above to a list comprehension is possible. However, it will be less readable to the former verbose syntax. Therefore, this issue should be sufficient. Next, we'll need to determine whether the study the entire subset, or further subset our reduced dataset. Furthermore, we'll need to obtain the corresponding timeseries stock values. This will propel us to the exploratory and analysis phase.

jeff1evesque commented 5 years ago

As stated earlier, we need to determine whether the dataset needs to be further reduced. Additionally, we need to determine how to pull the corresponding timeseries stock data:

tickers-1

jeff1evesque / fin-654

Determine economic datasets #7