ropensci / visdat

Preliminary Exploratory Visualisation of Data
https://docs.ropensci.org/visdat/
Other
450 stars 47 forks source link

vis_miss failing to read 60MB data set #114

Closed raviswanath closed 5 years ago

raviswanath commented 5 years ago

Hey,

I tried running vis_miss on a 61 MB data set. It failed the first time giving the below error:

Code: vis_miss(x = MPN_data)

Error: Error in vis_miss(MPN_data): Data exceeds recommended size for visualisation, please consider downsampling your data, or set argument 'warn_large_data' to FALSE.

I changed the warn_large_data variable to FALSE and ran the code again, got the below error:

Code: vis_miss(x = MPN_data, warn_large_data = FLASE)

Error: Error in mutate_impl(.data, dots): Evaluation error: argument "x" is missing, with no default.

I read through the documentation and it mentioned about a size limit of 900,000 integer default (is that size of file in Kb or MB or...?)

njtierney commented 5 years ago

Hi There!

Could you try running your example with the reprex package? - unless you do not want to share an image of your data?

I've implemented a reprex here with an 80Mb dataset on the current dev version of visdat and unfortunately I can not replicate your issue.

library(visdat)

vis_miss(airquality,
         warn_large_data = FALSE)


pkg_install("nycflights13")
#> Error in pkg_install("nycflights13"): could not find function "pkg_install"

library(nycflights13)
pryr::object_size(flights)
#> 40.7 MB

vis_miss(flights, 
         warn_large_data = FALSE)


# double the size
flights_2 <- dplyr::bind_rows(flights, flights)

pryr::object_size(flights_2)
#> 81.1 MB

vis_miss(flights, warn_large_data = FALSE)

Created on 2019-03-25 by the reprex package (v0.2.1)

I read through the documentation and it mentioned about a size limit of 900,000 integer default (is that size of file in Kb or MB or...?)

The value of 900,000 is the number of values in the dataframe - so 100 rows and 100 columns gives 10,000 and 1000 rows and 100 columns gives 100,000. I should make that more explicit in the documentation.

At the moment the issue of data size is actually a difficult issue to solve - visdat draws a cell for each and every cell, and this becomes computationally expensive, so what is a "large" amount of data changes for each machine. In the future there will be some computational improvements where large rectangle's are drawn and only lines added for the missing values - see #59. This should hopefully resolve some of these large data issues.