joshuaulrich / rfimport

getSymbols() reboot
16 stars 2 forks source link

Determine how to handle duplicate symbols #3

Open joshuaulrich opened 3 years ago

joshuaulrich commented 3 years ago

A symbol spec vector could have the same symbol for multiple sources. And the symbols may be for different underlying series (e.g. "FOO" could be a stock or a FRED series).

What should we do in this case?

My intuition is that it shouldn't be allowed. We could throw an error in c.symbol_spec(). The error message should tell the user to remap one of the symbols.

Thoughts?

ethanbsmith commented 3 years ago

i have a very common use case where i get symbols for multiple periods. eg, daily weekly and monthly. i only go out to the data source for the daily and then re-sample to longer periods on top of that data. I can certainly handle this in higher level code, but it strikes me that may be a common use case and may be worth handling as well.

my current code base creates an env for each period, so I reference OHLC sets by price.cache[[period]][[Symbol]]

one solution that follows a similar pattern is to return a list of lists. where the first level entries would be the SymbolSpec. you could then pull out the exact data by:

x <- fech_symbols(sym.spec)
x[[sym.spec]][[Symbol]]

a non-recusive call to unlist would undo a lot of complexity when its not needed. this could even be handled with a simplify parameter to the fetch call.

just thinking out loud

joshuaulrich commented 3 years ago

I just had the thought that we could support symbol mapping by using the name of the element of the symbols argument as the ticker symbol. For example:

tickers <-
    c(sym_yahoo(symbols = c(SPY = "SPY", FOOyahoo = "FOO")),
      sym_fred(symbols = c(FOO = "FOO"))
)

There's a "FOO" from yahoo and fred, but we map the yahoo "FOO" to "FOOyahoo".

ethanbsmith commented 1 year ago

been stewing on this in the background. its fundamentally a URN problem. i think a the most robust model would be symbol@source/subsource

eg: TM@yahoo/nyse 7203@yahoo/jpx T10Y3M@fred GOOGL@tiingo/nasdaq

this is very similar to your ide, just expans on the idea w/ a fully structured URN. at a very high level, with defaults for source and partial matching on source + subsource, i think it should handle a lot of scenarios

joshuaulrich commented 1 year ago

Very interesting idea! So we could create a sym_urn() function that takes strings like "yahoo:TM@nyse" to fetch NYSE symbol "TM"from Yahoo Finance. Then we would know that we need to call sym_yahoo("TM") to dispatch to the correct import_ohlc() method.

So your examples would translate to:

I used the foo:bar syntax because that's how URNs are supposed to be structured (namespace-id:namespace-string).

The URN syntax makes it easy to specify a vendor's sub-source (e.g. Tiingo has EOD data and IEX data). I'm not sure how that would work for the current syntax. Maybe sym_tiingo() and sym_tiingo_iex()?

ethanbsmith commented 1 year ago

i guess i was sort of thinking that the namespace was quantmod, but also was thinking more conceptually than formal grammar. i'd have to re-read the spec before i had a strong view on any of this

i think a lot of schemes could work. definitely like the idea of a ParseSmbolUrn() function that would expose the components as a list or data.frame. that certainly could be built to delegate to source specific implementations as u describe. a lot of flexibility there. id prolly lean to the simplest form that meets all current know use cases.

on the tiingo example i'd lean to something like googl@tiingo.iex o r tiingo:googl@iex i think the sym_xxx functionality would need to be extended to support sub-sources, otherwise they dont add much value and we shouldnt introduce them

Current State: For the most part, Symbol@src is unique. i went through a number of sources and found only 1 exception to this so far (IBrokers). in fact, most of the underlying APIs only accept a symbol and sometimes date ranges. The sources have already mucked around with the symbols to make them unique

ethanbsmith commented 8 months ago

some use cases/scenarios that i think would be useful while considering design decisions:

  1. fetch data for multiple symbols from multiple sources in a single call. a. ideally date ranges and other parameters could be vectorized to allow diff values for each symbol
  2. a deterministic way to reference items in the result set, regardles whether they were requested from a single or multiple sources. a. eg. results shape/keys should not differ if there are symbol collisions. b. a simplify type parameter could even be the default to make interactive use less cumbersome.
  3. support parallel operations over items in the result. a. either directly usable w/ apply, mcapply, foreach etc. (preferable) b. or a w/ a wrapper that uses something like foreach, which allows the consumer to specify a sequential or parallel backend.
  4. combine c() result data from previous calls to diff sources and diff date ranges. a. should do something like rbind() symbol within a source.
  5. subsetting a. by symbol(s) b. by source(s) c. by date range d. combinations of above e. by columns across sources and date ranges (really just a shorthand for common lapply scenarios)
joshuaulrich commented 8 months ago

some use cases/scenarios that i think would be useful while considering design decisions:

Moved to #18 to keep this issue focused on handling duplicate symbols.