maybe grab copy of cran and scrape from it locally

sckott commented 7 years ago

instead of scraping via web requests - maybe too much complexity

https://cran.r-project.org/mirror-howto.html

fmichonneau commented 7 years ago

All the data for the check results are stored in rds files available from CRAN. There are 2 main files one with just the summary of the checks (https://cran.r-project.org/web/checks/check_results.rds), and another one with the details of the all the check results (https://cran.r-project.org/web/checks/check_details.rds). I have (unexported) functions in foghorn to download and read them: https://github.com/fmichonneau/foghorn/blob/master/R/cran_files.R

So if you can have a cron job that download these files regularly, and a way to make the data available to the API, then there is no need to rely on scrapping the website. One of the reasons I had opted for plumber (https://github.com/fmichonneau/cranstatus/) was to be able to serve this data directly.

sckott commented 7 years ago

Thanks @fmichonneau - thx for letting me know. I didn't know about the rds files. Though i guess that means I would have to include R in the stack which there isn't right now. Hmmm. The scraping isn't ideal, but i do only do it once a day. How often are the rds files updated? I had assumed they only ran checks once a day?

fmichonneau commented 7 years ago

my understanding is that it's updated each time a package check completes, so probably many times a day. yeah you'd have to include R (unless there is a way to extract data from the rds files outside of R?)

sckott commented 6 years ago

with new async http requests and parallel html parsing, i think it's fast enough now

pkg scraping is down to about 4 min or so, and maintainers to 10 min or so

sckott / cchecksapi

maybe grab copy of cran and scrape from it locally #9