rstudio / pointblank

Data quality assessment and metadata reporting for data frames and database tables
https://rstudio.github.io/pointblank/
Other
845 stars 51 forks source link

`ggplot2::ggsave()` error with `pointblank::scan_data()` #515

Open Thiyaghessan opened 4 months ago

Thiyaghessan commented 4 months ago

Prework

Description

Error triggered when executing scan_data() on a data.table object with 34 rows and 268 columns:

Error inggplot2::ggsave(): ! Dimensions exceed 50 inches (heightandwidthare specified in inches not pixels). ℹ If you're sure you want a plot that big, uselimitsize = FALSE. Runrlang::last_trace()to see where the error occurred.

rlang::last_trace() Output:

Backtrace:
     ▆
  1. ├─pointblank::scan_data(charities_pc_dat[[1]])
  2. │ └─pointblank:::build_table_scan_page(...)
  3. │   └─sections %>% ...
  4. └─base::lapply(...)
  5.   └─pointblank (local) FUN(X[[i]], ...)
  6.     └─pointblank:::probe_interactions_assemble(data = data, lang = lang)
  7.       ├─base::suppressWarnings(probe_interactions(data = data))
  8.       │ └─base::withCallingHandlers(...)
  9.       └─pointblank:::probe_interactions(data = data)
 10.         └─ggplot2::ggsave(...)

Reproducible example

URL <- ""https://nccsdata.s3.amazonaws.com/harmonized/core/CORE-2009-501C3-CHARITIES-PC-HRMN.csv"

data <- data.table::fread( URL )

pointblank::scan_data( dat )

Expected result

scan_data() should have returned the HTML output.

Session info

R version 4.2.0 (2022-04-22 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 22621)

yjunechoe commented 4 months ago

You get that error because the data has 34 rows and 268 columns, and some of the plots automatically generated by scan_data() map those 268 columns to the axes or facets - this results in an incredibly large/dense plot which causes the error you see in ggsave().

If you only want sections from scan_data() that does not produce plots, I think scan_data(data, sections = "OVS") would do it. If you do want plots for correlations, missing variables, etc., I'd look into whether you can collapse or pivot-longer some of the columns that you have.

rich-iannone commented 4 months ago

I think it might be good here to introduce a limit (maybe 10?) on the number of columns used in these parts of the scan data report. This will at least make the function work w/o failing on the default options. On top of this, it would be useful to have a columns arg of some sort that allows the user to choose what's being used in these reporting parts.

The eventual goal, I think, is to have these report sections become a bit more scalable with larger amounts of data (perhaps using gt to arrange things, so you'd get scrolling and not these very tiny subplots).

SpikyClip commented 2 months ago

Anyone find any workaround for this, or a way to supply the limitsize arg?

I think it would be really useful for a tool like scan_data to be able to handle more columns, as those are the scenarios where its useful to have a script break down what columns contain useful/sparse information, so you can then subset datasets.

yjunechoe commented 2 months ago

@SpikyClip Thanks for this perspective - it's helpful to know that scan_data() is useful for determining the importance of variables prior to subsetting.

As Rich mentioned above, scan_data() will need some rework to accommodate larger data frames because currently some sections of the report (like the matrix plot) do not scale well with many columns. For now, scan_data(data, sections = "OVS") is a workaround to only render the sections of the report that easily handles many columns.

We could patch in a workaround for letting users supply the limitsize argument, but the fundamental challenge seems to run deeper. (Toggling limitsize off would make the error go away but your report could end up a huge self-contained html in hundreds of megabytes).

Happy to hear any suggestions on this!

SpikyClip commented 2 months ago

We could patch in a workaround for letting users supply the limitsize argument, but the fundamental challenge seems to run deeper. (Toggling limitsize off would make the error go away but your report could end up a huge self-contained html in hundreds of megabytes).

I see, that makes sense. What if, past a certain arbitrary number of columns, it renders the plot data as a filterable data table rather than a plot? Would still allow the user to view the key information if necessary (e.g. which columns have the most NA values, which columns correlate the most), and it'll hopefully catch the ggsave error. I don't think(?) the underlying tables would take up too much space but it may require testing.