ropensci / visdat

Preliminary Exploratory Visualisation of Data
https://docs.ropensci.org/visdat/
Other
450 stars 47 forks source link

nothing is returned when warn_large_data = FALSE #54

Closed verajosemanuel closed 6 years ago

verajosemanuel commented 7 years ago

My df has 30.000 rows, so I set warn_large_data to FALSE

visdat::vis_miss(df, warn_large_data = F)

And all I get is this empty grid:

visdat

njtierney commented 7 years ago

Hi @verajosemanuel !

Thanks for posting this - have you tried downsampling your data?

Perhaps some code such as

library(dplyr)
data %>% sample_frac(0.1) %>% vis_dat()

could work?

I think that I need to be more clear in the error message that visualising data of a large size such as this is largely dependent on the computing environment. For example, my machine can run the below code and produce the graphics, but someone with a less powerful laptop or PC cannot.

library(visdat)

# fake large data,

fake_large <- tibble::as_tibble(matrix(1:1e6, nrow = 1e5))

vis_dat(fake_large)
#> Error in vis_dat(fake_large): Data exceeds recommended size for visualisation, please consider
#>          downsampling your data, or set argument 'warn_large_data' to FALSE.
vis_dat(fake_large, warn_large_data = FALSE)


library(nycflights13)

vis_dat(flights)
#> Error in vis_dat(flights): Data exceeds recommended size for visualisation, please consider
#>          downsampling your data, or set argument 'warn_large_data' to FALSE.
vis_dat(flights, warn_large_data = FALSE)

This is a difficult problem to debug as it usually depends on the computing system, and is why we implemented this error message, but we could probably be more clear that setting warn_large_data = FALSE does not necessarily mean your visualisation will work.

My current understanding is that this limitation is down to the capabilities of:

  1. an individual's machine, and
  2. ggplot2

Future implementations of visdat will incorporate plotly libraries, which might be more capable of handling larger datasets.

Let me know if you have any questions! :)

verajosemanuel commented 7 years ago

downsampling works. In my case computation capacity is not a problem. Got plenty of ram (64) and all needed processors. To check if there's a limitation i've checked with package "narnia" (yes, so it's called) to get a glimpse of NA values with:

narnia::gg_miss_var(df)

Worked flawlessly. visdat is a great package and I use it many times.

thanks a lot.

njtierney commented 7 years ago

OK that is interesting to note - @karawoo, do you know if this sort of problem could be down to grid or ggplot? I feel like perhaps the most likely answer is the way that I have coded visdat.

@verajosemanuel I'm glad to know that naniar::gg_miss_var(df) has been useful for you - note that it is now called naniar (the name was changed a few times but is now settled).

verajosemanuel commented 7 years ago

yeah, I know, but somehow the first time I've tried to install it failed. I'll give it a try again. Something came to mind: why two packages with similar features? why not join visdat and naniar?

regards

njtierney commented 7 years ago

Interesting!

If you have an installation problem on naniar please file an issue :)

Good question.

visdat is designed to solve a narrow scope of problem - visualising whole dataframes as a preliminary visualisation. By reducing it to this particular issue it makes the package simpler to maintain, as the package only deals with these kinds of visualisations.

naniar is designed to deal with missing data in R, and is much larger in scope than just exploratory visualisations, it provides a framework to explore and analyse missing data.

njtierney commented 6 years ago

Let me know if you have any further questions @verajosemanuel :) Just tidying up issues now, but feel free to let me know if you want to reopen it.

peeter-t2 commented 5 years ago

This may be a possible route for update - I had troubles with a large dataframe too (200,000 rows, 10 columns, and didn't want to take a random sample) and was able to solve it by reusing the code from vis_data and vis_miss functions and replacing the geom_raster with geom_tile. I'm not sure why, but in this case it did not have the problem of failing to make the plot. Might be worth looking into.