yfR: Downloads and Organizes Financial Data from Yahoo Finance #523

Closed msperlin closed 2 years ago

msperlin commented 2 years ago

Date accepted: 2022-06-21

Submitting Author Name: Marcelo Perlin
Submitting Author Github Handle: @msperlin
Repository:
Version submitted: 0.0.1
Submission type: Standard
Editor: @melvidoni
Reviewers: @s3alfisc, @thisisnic

Due date for @s3alfisc: 2022-05-29
Due date for @thisisnic: 2022-06-13

Due date for @s3alfisc: 2022-05-29 Due date for @thisisnic: 2022-06-13

Archive: TBD Version accepted: TBD Language: en

Package: yfR
Title: Downloads and Organizes Financial Data from Yahoo Finance
Version: 0.0.1
Authors@R: person("Marcelo", "Perlin", email = "", role = c("aut", "cre"))
Description: Facilitates download of financial data from Yahoo Finance <>, 
 a vast repository of stock price data across multiple financial exchanges. The package offers a local caching system
 and support for parallel computation.
    R (>= 4.1)
Imports: stringr, curl, tidyr, 
    lubridate, furrr, purrr, future, tibble, zoo,
    cli, readr, rvest, dplyr, quantmod
License: MIT + file LICENSE
LazyData: true
RoxygenNote: 7.1.2
    testthat (>= 3.0.0),
VignetteBuilder: knitr
Config/testthat/edition: 3


Package yfR retrieves and organizes data from Yahoo Finance, a large repository for stock price data.

Target audience are students, researchers and industry practioneers in the field of Finance and Economics.

Package yfR is the second and backwards-incompatible version of BatchGetSymbols, also developed by me. My plan is to first deprecate BatchGetSymbols and later remove it from CRAN and archive it in Github.

Moreover, there are other packages, such as quantmod, that downloads data from Yahoo Finance, but none with similar features to yfR and BatchGetSymbols.


Unfortinately, I was not able to run pkgcheck locally as I was unable to install (or make) dependency ctags in my Linux Mint 20.3 machine. Nonetheless, I read through and followed all guidelines available in the manual.

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

MEE Options - [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code]( - (*Scope: Do consider MEE's [Aims and Scope]( for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

Package License: MIT + file LICENSE

1. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

Details of statistical properties

The package has:
- code in R (100% in 8 files) and
- 1 authors
- 1 vignette
- no internal data file
- 14 imported packages
- 6 exported functions (median 16 lines of code)
- 34 non-exported functions in R (median 12 lines of code)

1a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package

2. goodpractice and other checks

Details of goodpractice and other checks (click to open)

--- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck]( rcmdcheck found no errors, warnings, or notes #### Test coverage with [covr]( Package coverage: 87.78 #### Cyclocomplexity with [cyclocomp]( The following functions have cyclocomplexity >= 15: function | cyclocomplexity --- | --- yf_get | 23 yf_get_single_ticker | 22 #### Static code analyses with [lintr]( [lintr]( found the following 2 potential issues: message | number of times --- | --- Avoid library() and require() calls in packages | 2

Package Versions

|package |version | |:--------|:---------| |pkgstats | | |pkgcheck | |

Editor-in-Chief Instructions:

Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.

mpadge commented 2 years ago

@jooolia The faling check is just because the README does not have a CI badge. @msperlin Could you please add an R CMD check badge to your readme? (We check for CI via badges rather than workflow results, because we do accept submissions from arbitrary code-hosting platforms, not just GitHub.) Thanks!

msperlin commented 2 years ago

Good morning.

Sure, I just added the R-CMD badge.

jooolia commented 2 years ago

@ropensci-review-bot check package

Editor-in-Chief Instructions:

This package is in top shape and may be passed on to a handling editor

jooolia commented 2 years ago

Dear @msperlin, Thank you for your submission. The package has passed all of the automated package checks and the test coverage is good. Could you expand a bit more on how this package differs from quantmod and tidyquant? Thanks, Julia

msperlin commented 2 years ago

Good morning Julia,

The main goal of yfR is to help user download large ammounts of data from Yahoo Finance (YF).

Packages quantmod and tidyquant also offers a function for downloading price data from YF, but only that. Besides importing data, yfR offers the following functionalities:

jooolia commented 2 years ago

Thank you @msperlin, I am discussing with the other editors and will get back to you. Thanks, Julia

jooolia commented 2 years ago

Thanks for your patience @msperlin. The fit seems to be good for us and I am now looking for a handling editor. Thanks, Julia

msperlin commented 2 years ago

Great, thanks @jooolia.

jooolia commented 2 years ago

@ropensci-review-bot assign @melvidoni as editor

melvidoni commented 2 years ago

@ropensci-review-bot seeking reviewers

msperlin commented 2 years ago

Thanks. The badge is added in dc712f4abac246604721ed7f2926f9794e4e7f99 and the news file already exists.

Athene-ai commented 2 years ago

Hi @melvidoni ! I would like to review this package

melvidoni commented 2 years ago

Hi @melvidoni ! I would like to review this package

Hello @Athene-ai, of course, this package is still needing reviewers. I saw you wrote on several packages, so be mindful that asking in multiple places may not be ideal, as you may end up with more workload than intended. The review timeframe for this is 3 weeks, so if that's okay with you, I'll assign you to this package (and you'll have to complete this review first before accepting any others).

Athene-ai commented 2 years ago

@melvidoni I accept the invitation to review this package within three weeks

melvidoni commented 2 years ago

@ropensci-review-bot assign @Athene-ai as reviewer

Athene-ai commented 2 years ago

@melvidoni thanks for adding me as reviewer and I filled the volunteer form for being an rOpenSci Reviewer :-)

Athene-ai commented 2 years ago

@melvidoni do we have a slack channel?

melvidoni commented 2 years ago

@ropensci-review-bot assign @s3alfisc as reviewer

melvidoni commented 2 years ago

@melvidoni do we have a slack channel?

Hello @Athene-ai. Please, be mindful that responses are not immediate, especially over the weekend; kindly do not hasten people, and wait for responses/actions. There is much going on "behind the scenes" that you may not be aware of.

That said, you'll get an invitation to the Slack later in the process.

Athene-ai commented 2 years ago

@melvidoni do we have a slack channel?

Hello @Athene-ai. Please, be mindful that responses are not immediate, especially over the weekend; kindly do not hasten people, and wait for responses/actions. There is much going on "behind the scenes" that you may not be aware of.

That said, you'll get an invitation to the Slack later in the process.

Thanks for the information 😊

mpadge commented 2 years ago

@Athene-ai Could you please paste a completed review here? Rather than adding more comments to this issue, you may leave that template there for now, and update it with an actual review when you've got that far. It's best to complete the template offline, edit the issue to delete all current content, and then simply paste the completed review back in place of the above comment. Thanks.

melvidoni commented 2 years ago

@ropensci-review-bot remove @Athene-ai from reviewers

melvidoni commented 2 years ago

@msperlin we apologise for the issues caused with the prior reviewer. It has now been removed from the list of reviewers, and I will proceed to search for another reviewer. Please understand that although we try to give everyone an opportunity, sometimes it is not possible to foresee how will they take the opportunity.

I will strive to get a new reviewer, but the person will be given 3 weeks from the acceptance date, hence some delays are bound to happen.

Edit: wrong punctuation, apologies.

msperlin commented 2 years ago

Good morning @melvidoni.

No problem at all. I can wait.


melvidoni commented 2 years ago

@ropensci-review-bot assign @thisisnic as reviewer

s3alfisc commented 2 years ago
Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide


The package includes all the following forms of documentation:

You can find more comments on documentation below.


Estimated hours spent reviewing: 8

Additional Comments

I think that yfR is a very promising package with useful features, and I believe that it will be widely used. I very much enjoyed using it! To improve the package, I mostly suggest to invest more time into refining the documentation.


Installation, Local CMD Check & pkgcheck



Additional Functionality


  # check for NA
  if (any( {
    my_msg <- paste0(
      "Found NA value in ticker vector.",
      "You need to remove it before running BatchGetSymbols."

    if (class(first_date) != "Date") {
    stop("ERROR: cant change class of first_date to 'Date'")

In general, I really like the dreamerr package for function input type checks. checkmate seems to be very popular, too.

With dreamerr, you could e.g. write

  # check threshold
  if ((thresh_bad_data < 0) | (thresh_bad_data > 1)) {
    stop("Input thresh_bad_data should be a proportion between 0 and 1")


dreamerr::check_arg(thresh_bad_data, "scalar numeric GT{0} LT{1}")

I can’t really follow this error message:

  if (!flag) {
      "\nIt seems you are using a non-default cache folder at {cache_folder}. ",
      "Be aware that if any stock event -- split or dividend -- happens ",
      "in between cache files, the resulting aggregate cache data will not ",
      "correspond to reality as some part of the price data will not be ",
      "adjusted to the event. For safety and reproducibility, my suggestion ",
      "is to use cache system only for the current session with tempdir(), ",
      "which is the default option."
msperlin commented 2 years ago

Thanks @s3alfisc for the review! Appreciate it. Good ideas there.

I'll reply to all your comments in the next couple of days.

melvidoni commented 2 years ago

msperlin commented 2 years ago

Dear @s3alfisc , please find my replies below:

I think that yfR is a very promising package with useful features, and I believe that it will be widely used. I very much enjoyed using it! To improve the package, I mostly suggest to invest more time into refining the documentation.

Thanks, appreciate the feedback and the detailed review. Given your feedback and ideas, I've made many changes in the code and documentation.


Statement of need: I would like to see a more refined statement of need at the beginning of the readme: what is yfR’s main innovation? E.g. start with something like “yfR is an API to yahoo finance. It speeds up the data downloading process by parallel computing and local caching.” Then explain what type of data yahoo finance includes.

Also thanks. I changed the readme.rmd file so that the reader can quickly grasp how to use the package.

I would move the discussion of data quality / limitations of yahoo finance and comparison to BatchGetSymbols to separate articles - I don’t think they are required in the readme. If you want to keep the reference to quantmod, maybe include a dedicated ‘Acknowledgements’ section at the end of the readme? Occasionally, you use jargon: e.g., not all users might now what a ticker is. I would move all examples from the readme to the ‘get started’ vignette. Alternatively, I would keep only one example in the readme.

I reorganized the topics in the readme.rmd and moved some as vignettes.

In the ‘get started’ vignette, I would hide the message output generated e.g. by yf_get() and explain in words what the function does: e.g. it checks the cache, downloads data if the cache is empty, else finishes etc.

I rather keep the yfR messages in the vignettes as they mimic the actual call to the function. I also improved the text in the main vignette ("get started").

The vignette states that multiple ‘collections’ are organized in the package. It would be great to include a full list of collections to the docs, e.g. as a separate article? The yf_get_available_collections() helps here, but what do the individual collections stand for? E.g. does IBOV stand for the Bovespa-Index?

Great idea. I added argument print_description for yf_get_available_collections() for printing a text description of available collections:


I would like to see some documentation on how the caching works: e.g., where are files saved? For how long are they saved? Is the cache ever cleaned, e.g. are cached files lost by re-starting the R session?

I added a section at the help file of yf_get(), explaining how the cache system works.


In the docs for yf_convert_to_wide, it would be good to print the initial long dataframe.


The documentation of yf_get() does not really, as a stand-alone, explain what the function does: download ticker data from yahoo finance, caching, parallelism etc. I would delete the reference to getSymbols. Note that as yf_get_default_cache_folder() is not exported, users will run into an error when trying yfr::yf_get_default_cache_folder().

Documentation was improved.

Also, mention that the ticker function argument is vectorized


You could improve the documentation for parallelism: I myself have never used furrr, so your hint to furrr::plan() is not too helpful. How about a dedicated article with a small example that illustrates how to run get_plan() in parallel? Also, I only learned from browsing the code that by default, half of all available cores are used.

I think that going into parallelism and furrr::plan() would be off topic. However, I added a link to furrr in argument do_parallel, so that the user can learn more about it, if desired.

What is the difference between a collection and an index?

A collection is just a bunch of tickers put together. An index can be a collection, but not all collections are indices.

Consider adding documentation of the data returned via yf_get(). Not being a financial economist, I for example have no idea what the price_adjusted column stands for. Beyond, what is the unit of measurement of the price variables? I suppose it is US Dollars? Further, what is the relationship between daily data and monthly data? Also, potentially add a note that when markets are closed, no data row will be created.

Done. New documentation is available at readme.rmd and also in help for yf_get().


examples could be more 'verbose', i.e. add documentation also, examples could be more 'exhaustive' - they are quite minimal at the moment the example for yf_convert_to_wide currently calls internal data - could you not simply attach the data set or load it?

I revised all examples, specially for the main function. I've made a few changes, but they look alright to me. Users can always check the vignettes for more details.

Installation, Local CMD Check & pkgcheck

Installation and CMD check pass without problems. I tried to run pkgcheck, but failed to get it to run. I suggest to run the pkgcheck action on github actions, at least for the time of the review.

I also failed to use pkgcheck on linux ubuntu/mint. I can't install its dependencies, despinte spending some time trying hard.


Code Coverage is currently only at around 80% - I would love to see this up at 95%, if not 100 :)

I tried my best to cover as much as possible, reaching 82,99%. One big miss is in the parallel computing part which, in the current version is not active (I removed it due to YF limits in the api call). There is a fix in course, but it depends on quantmod being in CRAN. I'll add the parallel tests once it is fixed.

The rest is just input error checking which, to me, fells fine to be uncovered (covering them would just be a gimmick). So, I'll not reach 100%, but will be close.


All examples work very nicely. Overall, it was a lot of fun using the package! In general, the console output is very helpful and very pretty!

Great, thanks!

I am not sure if I would have default function arguments for first_date() and last_date(). If you want to keep it, I would change it from 15 days to one month.


yf_convert_to_wide() is super helpful - great idea to directly include it in the package!

Thanks. I know some people use the data that way, even though I dont like it..

Could the API be more permissive, e.g. accept dates with format dd-mm-yyyy?

I feel that ISO format is fine. This is the standard in R and users should probably adapt to it.

When trying the “SP500” collection example, I ran into several ‘error in download’ errors. Still, the function finished eventually with ‘binding price data’. What exactly is going on here? Did the function eventually manage to fetch all tickers? If no, could there be a final message, e.g. ‘300/500 tickers successfully fetched. To fetch all others, do this …’.

Good idea. I implemented the message. The user will now be aware of the relative percentage of tickers in the output data, when comparing to the requested vector of tickers. Whenever that is lower than 50%, a message tells the user to wait for 15 minutes before running it again.

Good image

Bad image

I have seen that there is already a PR opened to alert users when they have reached the yahoo finance limit. This is would indeed be a great feature!

We are working on this issue, already with a viable solution that should become official soon. Nonetheless, the package works fine in a single session in all my tests.

Additional Functionality

It would be great to add further collections, e.g. NASDAQ, DAX, SP30, FAANG etc

Yes! Definitely. The idea is having something for everyone..

The equivalent Python package, yfinance offers a range of additional functionality, e.g. data on dividents, stock splits, and institutional investors. Do you plan to incorporate any of these into the package in the future?

No. My proposal is focusing on stock data importating and organization, only.

Currently, the cached files are saved in the rds file format via readr::read_rds(). There might be faster and/or more memory-friendly alternatives available. Have you considered adding a function argument that would allow users to store files e.g. in the parquet file format?

I believe that .rds files works fine for yfR (I never saw a performance issue). But I'll keep that in mind. Also, this is very easy to change in the future.

Have you considered to integrate an autoplot functions to plot stock prices. autoplot would e.g. generate plots similarly to those created in the readme / vignette.

No, but I'll also keep it in mind.

Would it be possible to give an estimate of consumed memory of all cached files prior to a download? I would also consider to export yf_get_default_cache_folder() so that users are aware of the function and can easily check where yfR creates the cache.

Probably, but I fell that file size is not really an issue. The cache files are really small.

Nonetheless, I added a "Diagnostics" text at the end of the execution of yf_get. It includes the current size of cache files (see previous figure with output "Diagnostics").

Also, function yf_get_default_cache_folder() is now exported and available to users.


Do you need to export the magrittr pipe when using it internally?

This was implemented so yfR is compatible with R >= 4.0.0 (personally I preffer the new pipe).

I was not aware that exporting it is unecessary (I simply used usethis::use_pipe() when creating the package). I also feel that no harm is done in allowing the user access to the pipe when loading yfR (I'm not aware of any conflicts).

I took a brief glance at the error messages, and most of them are clear and easy to understand. Maybe you could rephrase

Thanks, I fixed that.

In general, I really like the dreamerr package for function input type checks. checkmate seems to be very popular, too.

Thanks for the suggestion. I was not aware of this package. I'll have a look but, for the time being, I'll stay with the current code.

I can’t really follow this error message: "\nIt seems you are using a non-default cache folder at {cache_folder}. ",

I tried my best, but the explanation is more technical than what I can put in a message. What the user should know is that, for stocks, there is no garantee that cache files can be merged without problems. This happens because external events such as dividends, can alter the adjusted prices recursively. So, you can get a different adjusted price for the same ticker/day if the query is made in different days.

I changed the text so that the explanation is more clear.

The collections are created via hard coded (wikipedia) URLs. This is likely prone to errors - what if e.g. the URLs change? I understand the attractiveness of this ‘dynamic’ lookup, as e.g. the composition of stock indices might change over time. Maybe you could add a second look-up link (in case the main URL breaks), or you could add a ‘fallback’ data.frame containing the names of all firms included in an index at a fixed date to fall back to? See also this link on potential error handling of URLs via tryCatch.

The fallback dataframe is a great idea and I implemented it. I don't like the first one of a "backup" url as requires more webscrapping code, which can be very unstable and hard to maintain.

I also implemented argument force_fallback in yf_get_index_comp, which allows the user to read the offlines files directly.

My last comment (repeating something I mentioned above): the equivalent python package is called yfinance. Maybe a better / SEO optimized name for the package would be yfinanceR?

I really liked the name yfR. Its short and easy to remember. But thanks for the suggestion.

msperlin commented 2 years ago

All changes are in the main branch..

thisisnic commented 2 years ago

I am currently working on my review of this package, and hope to finish it in the next few days if nothing unexpected comes up! I had an issue when I was running the examples in the vignette though, and so to deliver partial feedback which might be useful in the meantime, I've opened this issue relating to it on the project repo:

melvidoni commented 2 years ago

I am currently working on my review of this package, and hope to finish it in the next few days if nothing unexpected comes up! I had an issue when I was running the examples in the vignette though, and so to deliver partial feedback which might be useful in the meantime, I've opened this issue relating to it on the project repo: msperlin/yfR#11

Hello Nicola, that's great, thank you!