Open Thiyaghessan opened 9 months ago
You get that error because the data has 34 rows and 268 columns, and some of the plots automatically generated by scan_data()
map those 268 columns to the axes or facets - this results in an incredibly large/dense plot which causes the error you see in ggsave()
.
If you only want sections from scan_data()
that does not produce plots, I think scan_data(data, sections = "OVS")
would do it. If you do want plots for correlations, missing variables, etc., I'd look into whether you can collapse or pivot-longer some of the columns that you have.
I think it might be good here to introduce a limit (maybe 10?) on the number of columns used in these parts of the scan data report. This will at least make the function work w/o failing on the default options. On top of this, it would be useful to have a columns
arg of some sort that allows the user to choose what's being used in these reporting parts.
The eventual goal, I think, is to have these report sections become a bit more scalable with larger amounts of data (perhaps using gt to arrange things, so you'd get scrolling and not these very tiny subplots).
Anyone find any workaround for this, or a way to supply the limitsize
arg?
I think it would be really useful for a tool like scan_data
to be able to handle more columns, as those are the scenarios where its useful to have a script break down what columns contain useful/sparse information, so you can then subset datasets.
@SpikyClip Thanks for this perspective - it's helpful to know that scan_data()
is useful for determining the importance of variables prior to subsetting.
As Rich mentioned above, scan_data()
will need some rework to accommodate larger data frames because currently some sections of the report (like the matrix plot) do not scale well with many columns. For now, scan_data(data, sections = "OVS")
is a workaround to only render the sections of the report that easily handles many columns.
We could patch in a workaround for letting users supply the limitsize
argument, but the fundamental challenge seems to run deeper. (Toggling limitsize
off would make the error go away but your report could end up a huge self-contained html in hundreds of megabytes).
Happy to hear any suggestions on this!
We could patch in a workaround for letting users supply the limitsize argument, but the fundamental challenge seems to run deeper. (Toggling limitsize off would make the error go away but your report could end up a huge self-contained html in hundreds of megabytes).
I see, that makes sense. What if, past a certain arbitrary number of columns, it renders the plot data as a filterable data table rather than a plot? Would still allow the user to view the key information if necessary (e.g. which columns have the most NA values, which columns correlate the most), and it'll hopefully catch the ggsave error. I don't think(?) the underlying tables would take up too much space but it may require testing.
Prework
Description
Error triggered when executing scan_data() on a data.table object with 34 rows and 268 columns:
Error in
ggplot2::ggsave(): ! Dimensions exceed 50 inches (
heightand
widthare specified in inches not pixels). ℹ If you're sure you want a plot that big, use
limitsize = FALSE. Run
rlang::last_trace()to see where the error occurred.
rlang::last_trace() Output:
Reproducible example
Expected result
scan_data() should have returned the HTML output.
Session info
R version 4.2.0 (2022-04-22 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 22621)