joshuaulrich / rfimport

getSymbols() reboot
16 stars 2 forks source link

Use cases and design decisions #18

Open joshuaulrich opened 8 months ago

joshuaulrich commented 8 months ago

From @ethanbsmith:

Some use cases/scenarios that I think would be useful while considering design decisions:

  1. fetch data for multiple symbols from multiple sources in a single call.
    1. ideally date ranges and other parameters could be vectorized to allow different values for each symbol
  2. deterministic way to reference items in the result set, regardless whether they were requested from a single or multiple sources.
    1. e.g. results shape/keys should not differ if there are symbol collisions.
    2. a simplify type parameter could even be the default to make interactive use less cumbersome.
  3. support parallel operations over items in the result.
    1. either directly usable w/apply, mcapply, foreach etc. (preferable)
    2. or a w/ a wrapper that uses something like foreach, which allows the consumer to specify a sequential or parallel backend.
  4. combine c() result data from previous calls to diff sources and diff date ranges.
    1. should do something like rbind() symbol within a source.
  5. subsetting
    1. by symbol(s)
    2. by source(s)
    3. by date range
    4. combinations of above
    5. by columns across sources and date ranges (really just a shorthand for common lapply scenarios)
joshuaulrich commented 8 months ago

Thanks for the thoughts!

  1. This works now. Can you give some examples of parameters you would want to vary by symbol in (i.)? It's not clear to me why you'd want to import a set of symbols into one object if they didn't have the same dates.

  2. This "works" now. The output list is in the same order as the symbols passed to the import method. Duplicate symbols are an issue though (see below). Can you expand on what you mean by (ii.)?

require(rfimport)
s <- c(sym_yahoo("SPY"), sym_tiingo("SPY"))
x <- import_ohlc_collection(s)
sapply(x, attr, "src")
##      SPY      SPY 
## "tiingo" "tiingo" 

### reverse the symbol source order...
s <- c(sym_tiingo("SPY"), sym_yahoo("SPY"))
x <- import_ohlc_collection(s)
sapply(x, attr, "src")
##      SPY      SPY 
## "tiingo" "tiingo"
  1. This probably deserves discussion in its own issue.

  2. Not sure what you're proposing here. Maybe an example would help?

  3. I started an ohlc_collection class for things like this, and that's the class import_ohlc_collection() returns. There's a subset method for it, but doesn't do anything other than regular list subsetting (by symbol). We could add arguments for "source", "dates", etc.

ethanbsmith commented 8 months ago

not suggesting these are issues. was writing these down as notes for myself and figured it woudl be better to put them somewhere safe

  1. i have a paged cache that only goes out to the source for deltas. eg if i already have 5 years of GE data in cache and run a study on GE;SPY for 10 years, i only want to fetch the prior 5 years of GE and the full 10yrs of SPY. right now i have loop through symbols and call getSymbols for each symbol individually. So would like to vectorized date ranges
  2. in the example above, i see 2 itrems named SPY in x. how do i get the tiingo version of spy from x? just want to make sure that for production code, this can be done consistently, regardless of whether or not there are symbol collisions. i.e. if we go down the path of items being indexes under symbol@src, would prefer there is a way to do that whether or not there are collisions. doesnt need to a be a fixed single solution and optionality is probably good here
  3. wording probably ventured to far into solutions. was really just meant as a requirement/use case
  4. see caching use case in 1 above. just want to merge date ranges for the same symbol (rbind)
  5. cool. will take a look
joshuaulrich commented 8 months ago

not suggesting these are issues

I didn't think you were saying these were problems. I view them as feature requests and/or enhancements. I just wanted them in their own place so they were their own topic and easier to find later.

  1. We should look at how caching and parallelism is implemented in yfR by @msperlin. It appears to cache by session, not persistently between sessions.
  2. I don't have a good solution for this. My initial thought is that we don't allow duplicate symbols in 'sym_spec' objects. So you'd need to make one call per duplicate symbol source, or use the 'sym_urn' method of specifying symbols. I'm open to other ideas though.
  3. Example input and output would be helpful here. Then we could use it for a unit test.
ggrothendieck commented 3 months ago

I would be careful about imposing too much boilerplate on users with nested functions. Most of the time users will only be using one source and that should be easy to use without having further considerations.

joshuaulrich commented 3 months ago

@ggrothendieck thanks for the feedback! I assume you're referring to the import_*(sym_*()) code? How do you think we could get around that?

Off the top of my head, we could parse a single string that has the source at the beginning. For example:

import_ohlc("yahoo:AAPL,SPY,IBM")   # yahoo finance
import_ohlc("tiingo:AAPL,SPY,IBM")  # tiingo

But that's a bit of a pain if the symbols are in a vector.

Another idea is to have a global option that users can set to a single source, and we can add a src = getOption("rfimport_default_sym_source") argument to the import* functions.