ropensci / skimr

A frictionless, pipeable approach to dealing with summary statistics
https://docs.ropensci.org/skimr
1.11k stars 79 forks source link

Update "Supporting Additional Objects" vignette #575

Closed kylebutts closed 4 years ago

kylebutts commented 4 years ago

Few changes:

  1. Since sf objects have class = "sfc_GEOMETRYTYPE" "sfc", we can define sfl for "sfc" that will be called if there does not exist and sfl for "sfc_GEOMETRYTYPE".

  2. I made the example a bit more relevant for sf objects, namely summarizing the CRS (projection) being used. I do think that sf objects should be used as an included sfl, but that could be because I spend a lot of time working with spatial objects ;).

  3. There is mention of skim_by_type.sfc_MULTIPOLYGON, which I don't think needs to be in here unless I misunderstand skimr's code. The "skim_with" factory should already use the base sfl's for numeric, factors, characters, etc. and as you can see the skim_sf() works automatically without the skim_by_type function.

codecov-io commented 4 years ago

Codecov Report

Merging #575 into develop will not change coverage. The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff            @@
##           develop     #575   +/-   ##
========================================
  Coverage    96.09%   96.09%           
========================================
  Files           13       13           
  Lines          563      563           
========================================
  Hits           541      541           
  Misses          22       22

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a3aed1f...1d24787. Read the comment docs.

elinw commented 4 years ago

This is great! Thanks.

elinw commented 4 years ago

There are really two separate issues, one is skimming a data frame or tibble that has some columns that are of non standard types. The other is skimming a different kind of object completely without coercing it to be a data frame and losing crucial information. When I wrote the vignette I went through the whole process multiple times and found that I needed this step.

Just to expand, and maybe the text is not clear on this, but it is complex and confusing if you haven't thought through it.

To skim a column of a particular type in a data frame you use skim_with().

To skim an object of a class that is not data frame without coercing it to be a data frame and actually a tibble you need to create a skim_by_type.class() for that kind of object. You would do this because a simple features object that is coerced to a data frame/tibble is going to lose crucial information.

It does not help that in simple features they have the same name.

kylebutts commented 4 years ago

I think I'm confused by the skim_by_type.class() function. Is the key purpose to: (1) grab information from the object and then (2) coerce to data frame/tibble and skim? Do you have any examples you worked through when writing the vignette? I imagine the problem would be if sf objects, for example, stored the projection not in the column, but in the entire object.

Also, the vignette says "Creating a function that is a method of the skim_by_type generic for the data type allows skimming of an entire data frame that contains some columns of that type" which I do not think is matching with your description, so I'll try to rewrite that when I figure out the skim_by_type.

elinw commented 4 years ago

The whole vignette is one big example. But here's something interesting with the new dplyr v1 groups(nc) returns geometry and ungrouping does not seem to help.

elinw commented 4 years ago

One other question. Is there any situation in which the different geometry types would have different statistics? I know you can only have one sf column per sf object but say I am a package developer and I want to support any kind of geometry that a user throws at me, would I possible have different sfls for them?

elinw commented 4 years ago

Hi, I'm closing this in favor of https://github.com/ropensci/skimr/pull/603

A few notes. I think that one difference between what we needed in this vignette and what we need now is how that sf now is handled as a data frame when it first starts being skimmed. Therefore we no longer need to have something for handling a non data.frame object and are not coercing. So what that means is that this vignette is really moving toward being Using skimr in a Package. It's not quite there yet, but I think what I have is better.

I still don't know if there is the possibility of different functions for different geometries but I'll ignore that right now and just use the general sfc.

Thanks @kylebutts , we would not have gotten this done without you!

elinw commented 4 years ago

@kylebutts If there were a basic set of functions for an sfc sfl what would they be? Also could you answer the question about if there are potentially different functions for different sfc types?

kylebutts commented 4 years ago

I think the two main functions are sf::st_crs() which will get the current projection of the sfc column and sf::st_is_valid() which tells you how many rows of the sfc column are valid geometries.

I was thinking you may want sf::st_area or sf::st_length for polygons and lines respectively. These would depend on the sfc type, but could be a bit time consuming and not something you probably are curious about when you skim().