usa-npn / rnpn

R client for the National Phenology Network database API
https://rdrr.io/cran/rnpn/
Other
19 stars 9 forks source link

R stuck in reading npn intensity and status data #29

Closed tdlan80 closed 1 year ago

tdlan80 commented 1 year ago

I downloaded status and intensity data for 2009-2021 period from NPN web portal and I am having trouble reading it into R. I first tried:

library(readr)

eastUS <- read_csv(file = "webPortalData/status_intensity_observation_data.csv")

The script runs and get stuck for hrs... R does not become non-responsive but makes no progress even after

eta: 0s in the progress bar

I also tried several other reading functions

library(vroom) eastUS <- vroom(file = "webPortalData/status_intensity_observation_data.csv") still, no success in reading it

the progress bar is stuck in

indexing status_intensity_observation_data.csv [==================================================================---] 208.91MB/s, eta: 1s

vroomis better at reading large datasets than read_csv. These functions are working fine otherwise for other csv files, it is just the portal downloaded version that gives me this issue. I am trying to compare if I have the same data from both the web portal download and the rnpnpackage pull. that's why I need this odd way of getting the status and intensity data.

do you recommend a better package for reading the web portal data into R? I am not sure if this issue is due to large file size/memory or issue with reading the csv (some odd field separation issue).

alyssarosemartin commented 1 year ago

Hi Thilina -

I tried a test with a few years of Pinus strobus data (38K records of Status and Intensity) and it read in fine, with the base R read.csv function, so I think it's to do with the massive file/memory?

test <- (read.csv("PinusStrobusData.csv"))

tdlan80 commented 1 year ago

strangely, data.table::read.csv worked although it took time. i also ran data.table::fread as well-- first it failed due to lack of virtual memory, but ran under a minute once I cleaned the memory gc() I originally didn't run read.csv since that function is not verbose at all. in future, i might have to chunk the csv file before running it.