title: "rfimport: getSymbols Rebooted" author: Joshua Ulrich date: March 03, 2023 geometry: "left=2cm,right=2cm,top=2cm,bottom=2cm" output: pdf_document

Background and Motivation

The quantmod package has been a core part of the R/Finance ecosystem for over 15 years. It's awesome that the package is so popular, but that also comes with responsibility to maintain backward compatibility. Breaking changes may break code used for making business decision, research, blog posts, books, courses, answers on stackoverflow, and much more. We take this responsibility seriously, and do our best to keep functions backward compatible. Sometimes breaking changes are necessary (e.g. bug fixes, changes to external data sources, etc.), but we do our best to make them carefully, with plenty of warning and lead time for users to adjust their code.

There are things in quantmod that we want to change, but they would certainly break existing code. No matter how much we'd like to make those changes, we can't justify breaking a large portion of the code our community has written in the past 15+ years.

Introducing the "rfimport" package. It is a place to work on new implementations that improve on the pieces in getSymbols() that we would like to change. This code is extremely alpha. This is the time to provide feedback, suggestions, feature requests, etc. Know that we will break things, maybe without warning. You should consider the API unstable until the 1.0.0 release.

How the Old `getSymbols()` Works

By default getSymbols() creates objects for each Symbol in the environment it's called from, and it returns the value of the Symbol argument. It's generally good practice for functions to avoid changing anything in the user's environment (this is called having side-effects). It's better for functions to only return a value, like getSymbols(..., auto.assign = FALSE) does. getSymbols() does not support auto.assign = FALSE for more than one symbol.

getSymbols() also uses functionality that was formerly provided by the archived Defaults package. This functionality to allow users to set default values for getSymbols() source method arguments. This is also a side-effect because it makes getSymbols() depend on something other than its argument values.

getSymbols() specifies the data sources via its src argument. Then getSymbols() uses the src argument to determine which source method to use (e.g. getSymbols("SPY", src = "yahoo") will call getSymbols.yahoo("SPY") behind the scenes. This is essentially method dispatch, but done manually rather than using R's built-in S3 functionality.

What We've Learned

We should avoid the side-effect of creating objects in the calling environment.
We can use S3 method dispatch instead of creating getSymbols() source methods.
Stock ticker symbology is a pain and we need a better way to handle it.
We need a way to provide functionality like the Defaults package did, but without side-effects.

Automatically Creating Objects

getSymbols() creates an object for every value in the Symbols argument. This isn't an issue for a few symbols, but it clutters the environment when there are several hundred symbols. You can load all the symbols into a separate environment, but that's not a pattern most users are familiar with.

We wanted to remove the ability to load objects into the calling environment, and even created a warning about changing auto.assign = FALSE as the default for getSymbols() and recommending users replace their getSymbols() call with the loadSymbols() function that already existed. But we ultimately decided breaking the community's code wasn't worth it.

Automatically creating objects makes it cumbersome to put prices for all symbols into one object. This is a common use case and there are several steps. It should be possible with one or two function calls. Here's an example.

symbols <- c("SPY", "AAPL")
getSymbols(symbols)

# Put all the prices into one xts object,
prices <- do.call(merge, lapply(symbols, get))
# or
prices <- do.call(merge, mget(symbols))

# Extract only the Close prices
close_prices <- Cl(prices)

# Remove ".Close" suffix so close_prices[, "SPY"] works
colnames(close_prices) <- sub(".Close", "", colnames(close_prices))

head(prices[, symbols[1]])

Automatically creating objects also makes passing all the data to another function awkward. It causes users to do things like:

Call getSymbols() in any function that needs data, which may mean the same data is imported multiple times.
Pass the same symbols object to getSymbols() and the other function. Then the other function searches through environments to find the objects named with those symbols.
Users could put all the data in an environment and use that as an argument to the function, but I haven't seen many people use this pattern.

The lesson: a function that imports data should return the data like a normal R function, without using or creating side-effects.

Returning an object means we need a way to return multiple series in one object (e.g. a list). Returning one object also avoids the issue where the symbol is not a valid R object name (e.g. getSymbols("^DJI") creates an object named DJI). More on that later.

Source methods

Different getSymbols() source methods can (or may need to) have different arguments. Ideally the source methods wouldn't be exported because users shouldn't call them directly, like getSymbols.yahoo("SPY"). Not being exported makes it hard to find the help page for them, which means it's hard to know what arguments are available for various source methods.

The source methods are named like S3 methods even though getSymbols() isn't a generic function and the source methods aren't actual S3 methods. This has the potential to create odd behavior that would confuse users.

Ticker Symbology

There are two major issues with ticker symbols.

Exchanges and data providers sometimes use different ticker symbols for the same security.
Some ticker symbols are not valid R object names.

Another issue is when the ticker symbol is similar to the name of one of the price columns. This has come up several times with Lowe's (LOW). The Lo() and OHLC() functions think all of the columns with the ticker symbol in the column name are the low price for the period.

Same Security, Different Ticker

This isn't getSymbols()'s fault and it's out of our control, but it needs to be handled better. Exchange and data source symbology is awful. Identifiers for the same series are often different across exchanges and data providers. For example: the symbol for Berkshire Hathaway B-class shares is "BRK-B" for Yahoo Finance, "BRK/B" for the SIP (Securities Information Processor), "BRK B" for ICE, and probably "BRK.B" somewhere else.

Invalid R Object Names

getSymbols() tries to create objects with valid R names, but only does so for some symbols that aren't valid R object names. For example, BRK-B, BRK B, and BRK/B aren't valid R objects names because valid names start with a letter or a dot (period), and can only contain letters, numbers, a dot, or an underscore.

Here are some common examples of ticker symbology woes:

^DJI isn't a valid R object name because it starts with ^. So getSymbols() creates an object with the ^ removed. But then you can't use the code below to put all the prices into one object. Also notice that getSymbols() returns "^DJI" even though it creates an object with a different name.

symbols <- c("^DJI", "BRK-B")
getSymbols(symbols)
## [1] "^DJI" "BRK-B"

prices <- do.call(merge, mget(symbols))
## Error: value for '^DJI' not found

You have to remove the leading ^ manually. And you have to set fixed = TRUE in the call to sub() because ^ is a special character in regular expressions. Sigh.

prices <- do.call(merge, mget(sub("^", "", symbols, fixed = TRUE)))

Recall that BRK-B also isn't a valid R name because of the -. But it wasn't an issue in the code above because getSymbols() made an object named BRK-B, not an object with a valid R name. This is confusing for users because they can't easily access that object (i.e. head(BRK-B) is an error). This is a pervasive issue for several foreign exchanges with tickers that begin with numbers (e.g. 000001.SZ).

Another issue with symbols that aren't valid R object names is that many R functions will convert column names into valid R object names, including merge.xts(). So you can't use the input symbol to subset the resulting xts object. Here's an example:

# Extract the close prices and remove ".Close" suffix
close_prices <- Cl(prices)
colnames(close_prices) <- sub(".Close", "", colnames(close_prices))

# Extract the close price for "BRK-B"
close_prices[, "BRK-B"]
## Error in `[.xts`(close_prices, , "BRK-B") : subscript out of bounds
colnames(close_prices)
## [1] "DJI"   "BRK.B"

setSymbolLookup() exists to help with things like this, but it's another function users have to learn to use and my experience is that most users don't know about setSymbolLookup(). I just had to look at the source to figure out how to use it to make getSymbols() return a valid R object for "BRK-B".

setSymbolLookup(BRK.B = list(name = "BRK-B", src = "yahoo"))

If I have to look at the source code to figure out how to do this, users don't have a chance. You may think, "but you could document how to do this", but writing documentation isn't fun. And who reads the documentation anyway? ;-)

Defaults Functionality

The "Defaults" functionality in quantmod comes from the archived Defaults package. This functionality allows users to set new default argument values to any getSymbols() source function. This is helpful because it makes importing easier. But it means getSymbols() relies on something other than its parameter values, and it's good practice to avoid side-effects like this.

This gave users the ability to set preferences like return class, periodicity (e.g. hourly, daily, monthly), connection settings (e.g. credentials, API keys).

'rfimport' Design and Features

The design of 'rfimport' is influenced by the DBI package, which provides a set of generic 'database interface' functions. Users create connection objects by creating a 'driver' object for the specific database and passing that to dbConnect(). Then you pass that connection object to the other DBI functions. For example, to query an execute a statement for a PostgreSQL database:

library(RPostgreSQL)

driver <- PostgreSQL()
conn <- DBI::dbConnect(driver)
student_count <- DBI::dbGetQuery(conn, "select count(*) from students")

# Add a new student
new_student <- "Robert'); DROP TABLE students;--"
DBI::dbExecute(conn,
    "INSERT INTO students (name) VALUES ('?');",
    params = list(new_student))

To import data using 'rfimport':

library(rfimport)

# The sym_* functions are a combination of the
# driver, connection, and query in DBI
syms <- sym_yahoo("SPY")

# Import some data from Yahoo Finance
spy <- import_ohlc(syms)

Symbol Specification

The package introduces a new virtual S3 class "symbol_spec" as the basis for creating sub-classes that hold all necessary information to connect to a data source. This virtual class allows users to combine symbols from different data sources into a single vector. For example: import_ohlc_collection(c(sym_yahoo("SPY"), sym_tiingo("DIA"))) will import data for "SPY" from Yahoo Finance and data for "DIA" from Tiingo.

Each data source will have its own symbol_spec constructor. The constructor will have an argument for the vector of symbols and other arguments for all the other data source connection settings. It will return an object that inherits from the new virtual symbol_spec For example sym_yahoo() will return a c("yahoo", "symbol_spec") class vector.

The help page for the symbol spec constructors can also document the import methods that the data source supports. So help("sym_yahoo") would also contain information about import.yahoo() and import_collection.yahoo().

Ticker Symbology

The package would standardize how index tickers are specified. One possibility is to prefix the ticker with an 'i' or 'i_' (e.g. "iDJI" or "i_DJI").

It would also standardize how to specify share classes, warrants, preferred, etc. One possibility is to use an underscore to identify share classes, a lowercase 'w' for warrants, and a lowercase 'p' for preferred. For example, "BRK_B" for Berkshire Hathaway B shares, "FOOw" for warrants, "BARp" for preferred. We could also include a translation table and/or function.

It could also provide a way to map source symbols to user-defined values. It makes the most sense to do this is the sym_<source>() constructor. But how set and store the mapping? Some possibilities:

sym_yahoo(BRK.B = list(symbol = "BRK-B"))
sym_yahoo(c(BRK.B = "BRK-B", "DIA"))
sym_yahoo(c("BRK.B", "DIA"), sym_db = list(BRK.B = "BRK-B"))

Import Functions

Generics

The package will have generic functions import() and import_collection() to dispatch on symbol_spec sub-classes. import_collection() will return a list of xts objects for one or more symbols. import() only handles a single symbol and returns one xts object. Other generic import functions may be added in the future.

The generics will have a symbol_spec, dates, ..., and periodicity arguments, in that order. dates can be either an ISO 8601 date interval (e.g. dates = "2021-01-01/2021-12-31") or a two-element vector with the start and end dates (e.g. dates = c("2021-01-01", "2021-12-31")). The vector can be Date, POSIXct, or a character that is coercible to one of those two classes.

For example:

spy <- import(sym_yahoo("SPY"), dates = "2021/2022")
str(spy)

stocks <- sym_tiingo(c("AAPL", "NFLX")) |>
    import_collection(dates = "2021-03-01/2022-11-31")
str(stocks)

The package may also include generic import functions that return specific types of data. Particularly import_ohlc() for open, high, low, close, (adjusted, volume), and import_bbo() for best bid and offer.

Periodicity

The periodicity argument specifies the interval between data points (e.g. daily, monthly, 15-minute). The data source determines the possible periodicity values, so they're responsible for ensuring the requested periodicity value is available from the data source. 'rfimport' will provide a standard way to specify the periodicity values. Then the source methods can translate those values into the value source needs. For example, one data source may use "monthly" for monthly data and another may use "months". Users would set periodicity = "months" for either source and the source method would translate the value to "monthly", if necessary.

Source Methods

Each data source will have a S3 method for the relevant import generics, rather than a src argument like getSymbols(). Calling import(sym_yahoo("SPY")) will call the corresponding import.yahoo() method to import data from Yahoo Finance.

Returned Data

The built-in import() methods will automatically include dividends and splits (when available) when importing daily OHLCV data. They will be included as attributes on the returned OHLCV object. This will allow users to switch between adjusted and unadjusted prices without having to re-download the data.

They also will not include the series symbol in the OHLCV column names like getSymbols() currently does. It may make sense to include an attribute with the "source symbol" (e.g. "^DJI") and the "R symbol" (e.g. "DJI") on the returned xts object. Then that attribute can be used later if it makes sense as (part of) a column name.

Providing "Defaults" Functionality

Though we want to avoid side-effects, we probably want to provide a way to set credentials so they do not have to be provided for every import call.

We could provide this functionality in a pure way by creating an options object that holds a list of values. Users would create this object once and pass it to the relevant 'rfimport' function (either sym_<source>() or import()). The default options could be created by a function like sym_<source>_options(). This would be similar to the 'control' arguments to many optimization routines (e.g. DEoptim.control()).

Open Questions and Considerations

How should we specify the class of the returned object? Some possibilities:

Set it via a return_class argument in the symbol_spec constructor
- PRO: each source is likely to have a specific data structure, and it wouldn't require creating a generic import function for each return type.
- CON: allows the potential for one call to an import_*_collection() function to return a list of heterogeneous objects.
Set it via a return_class argument in the import method
- PRO: the method would return a list of objects that are all the same class.
- CON: the generic and/or the default import method would need a return_class argument.
Create a new generic import functions for each return class
- PRO: makes it clear what the import function returns.
- CON: namespace clutter, don't want generics for every class. Possibly provide generics for most widely used non-xts classes: data.frame, data.table, tibble, tsibble.
The symbol specification can store a function that controls what data is returned. Not sure I like that because it adds complexity and the user can call that function after the data is returned. For example: sym_yahoo("SPY", return_func = as.data.table).

How can we make it easier to manipulate results?

It should be easier to manipulate returned data for common use cases.

The most common use case is making a wide xts object with close prices from a list of xts objects. This currently requires several steps that are likely unfamiliar to most users. It should be possible with one or two function calls. We can consider Garrett See's qmao package for inspiration. For example, use 'price frames' to replace do.call(merge, list_of_objects_from_getSymbols)

There are lots of other common manipulations, like aggregating to a higher periodicity or applying a function to many symbols' data. The import functions will return something list-like, so users can use lapply() to apply any other function to each series.

Should we create new data types?

Should we create a new package with new classes for financial market data?

It would allow us to use functions like Cl(), OHLC(), etc. in packages that depend on each other. This is an issue in TTR. It can't use these functions because quantmod depends on TTR (so TTR can't depend on quantmod).

We could also create methods for these classes to do common tasks. For example: create an xts object of close prices from a list of OHLC xts objects, or calculate weighted midpoint from BBO data.

Data for each symbol would be stored separately, so column names wouldn't need to use the "Symbol.Data" pattern.

The import() methods could return the new class instead of a plain list. The new class also provides an opportunity to re-consider how the data is stored and how extractor functions (e.g. Cl()) work. We could also create some setter functions (e.g. Cl<-).

joshuaulrich / rfimport

readme

title: "rfimport: getSymbols Rebooted" author: Joshua Ulrich date: March 03, 2023 geometry: "left=2cm,right=2cm,top=2cm,bottom=2cm" output: pdf_document

Background and Motivation

How the Old `getSymbols()` Works

What We've Learned

Automatically Creating Objects

Source methods

Ticker Symbology

Same Security, Different Ticker

Invalid R Object Names

Defaults Functionality

'rfimport' Design and Features

Symbol Specification

Ticker Symbology

Import Functions

Generics

Periodicity

Source Methods

Returned Data

Providing "Defaults" Functionality

Open Questions and Considerations

How should we specify the class of the returned object? Some possibilities:

How can we make it easier to manipulate results?

Should we create new data types?

joshuaulrich / rfimport

readme

title: "rfimport: getSymbols Rebooted" author: Joshua Ulrich date: March 03, 2023 geometry: "left=2cm,right=2cm,top=2cm,bottom=2cm" output: pdf_document

Background and Motivation

How the Old getSymbols() Works

What We've Learned

Automatically Creating Objects

Source methods

Ticker Symbology

Same Security, Different Ticker

Invalid R Object Names

Defaults Functionality

'rfimport' Design and Features

Symbol Specification

Ticker Symbology

Import Functions

Generics

Periodicity

Source Methods

Returned Data

Providing "Defaults" Functionality

Open Questions and Considerations

How should we specify the class of the returned object? Some possibilities:

How can we make it easier to manipulate results?

Should we create new data types?

How the Old `getSymbols()` Works