Closed IndrajeetPatil closed 4 years ago
Hi Indrajeet!
That column was intentionally removed, as it duplicated information that is provided in the summary above.
I'm adding some functions now for getting this attribute.
# How many rows are in the original data frame?
skim(iris) %>% data_rows()
Best wishes, Michael
Of course you can also always add it as a custom function if you want it in the columns.
I searched in the bug reports and saw this. I think 'n' column is still very useful when using skim() with group_by. There are cases where one wants to summarize a response variable based on different levels of a categorical variable.
data_row()
is not helpful in this case because it only returns the data set's number of rows, not the number of rows in each level.
I agree it can be useful, but that's why skimr is so flexible in letting you add more statistics (or remove the ones you don't want). We will never have 100% agreement with all users about what should or shouldn't be in the default list. That's why summary() is frustrating leading us to made skimr easy to modify.
Thank you for your reply. I spent some time looking for a suitable function to define in skim_with()
for this purpose until I realized that the intended input of the function is a vector of numeric (in the case of numeric variables) and length()
worked. If the desired type of the input is clarified in the documentation of skim_with()
, it would extremely help.
Hi!
You're looking for base_skimmer behavior: https://github.com/ropensci/skimr/blob/93aa2dcb0b47559fc8fb0d3575c71609d3ca763b/R/skim_with.R#L64
I'll provide another example in the docs.
Wouldn't the n() or count() function work?
On Mon, Jan 27, 2020 at 4:46 PM Michael Quinn notifications@github.com wrote:
Hi!
You're looking for base_skimmer behavior:
https://github.com/ropensci/skimr/blob/93aa2dcb0b47559fc8fb0d3575c71609d3ca763b/R/skim_with.R#L64
I'll provide another example in the docs.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/ropensci/skimr/issues/505?email_source=notifications&email_token=AAFYI7N75EVSN67QSD2ZF2LQ75I35A5CNFSM4JGJIGF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKBF4TQ#issuecomment-578969166, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFYI7P7CPTSEKYSVBMPUQTQ75I35ANCNFSM4JGJIGFQ .
It doesn't work with skim_with
though. I can call it separately and then join with the result of skim but it's more convenient to have them at once.
Can you share the code you used to create the new skimmer?
On Sat, Feb 1, 2020 at 1:28 PM vinhtantran notifications@github.com wrote:
It doesn't work with skim_with though. I can call it separately and then join with the result of skim but it's more convenient to have them at once.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/ropensci/skimr/issues/505?email_source=notifications&email_token=AAFYI7NBR5VN2563326UWF3RAW5NPA5CNFSM4JGJIGF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKRDQTA#issuecomment-581056588, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFYI7NXZEOAYHDXW3LUHATRAW5NPANCNFSM4JGJIGFQ .
my_skim <- skim_with(base = sfl(n = length))
Hi @vinhtantran!
I'm having trouble reproducing an issue. Can you provide a reprex?
> my_skim <- skim_with(base = sfl(n = length))
> my_skim(iris)
── Data Summary ────────────────────────
Values
Name iris
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
factor 1
numeric 4
________________________
Group variables None
── Variable type: factor ───────────────────────────────────────────────────────
# A tibble: 1 x 5
skim_variable n ordered n_unique top_counts
* <chr> <int> <lgl> <int> <chr>
1 Species 150 FALSE 3 set: 50, ver: 50, vir: 50
── Variable type: numeric ──────────────────────────────────────────────────────
# A tibble: 4 x 10
skim_variable n mean sd p0 p25 p50 p75 p100 hist
* <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Sepal.Length 150 5.84 0.828 4.3 5.1 5.8 6.4 7.9 ▆▇▇▅▂
2 Sepal.Width 150 3.06 0.436 2 2.8 3 3.3 4.4 ▁▆▇▂▁
3 Petal.Length 150 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▆▇▂
4 Petal.Width 150 1.20 0.762 0.1 0.3 1.3 1.8 2.5 ▇▁▇▅▃
As I commented above, the sample size is useful when skimmer is used on grouped data. The following code is how I would use my_skimmer
defined above.
> iris %>% group_by(Species) %>% my_skim()
-- Data Summary ------------------------
Values
Name Piped data
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
numeric 4
________________________
Group variables Species
-- Variable type: numeric --------------------------------------------------------------------
# A tibble: 12 x 10
skim_variable Species n mean sd p0 p25 p50 p75 p100
* <chr> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Sepal.Length setosa 50 5.01 0.352 4.3 4.8 5 5.2 5.8
2 Sepal.Length versicolor 50 5.94 0.516 4.9 5.6 5.9 6.3 7
3 Sepal.Length virginica 50 6.59 0.636 4.9 6.22 6.5 6.9 7.9
4 Sepal.Width setosa 50 3.43 0.379 2.3 3.2 3.4 3.68 4.4
5 Sepal.Width versicolor 50 2.77 0.314 2 2.52 2.8 3 3.4
6 Sepal.Width virginica 50 2.97 0.322 2.2 2.8 3 3.18 3.8
7 Petal.Length setosa 50 1.46 0.174 1 1.4 1.5 1.58 1.9
8 Petal.Length versicolor 50 4.26 0.470 3 4 4.35 4.6 5.1
9 Petal.Length virginica 50 5.55 0.552 4.5 5.1 5.55 5.88 6.9
10 Petal.Width setosa 50 0.246 0.105 0.1 0.2 0.2 0.3 0.6
11 Petal.Width versicolor 50 1.33 0.198 1 1.2 1.3 1.5 1.8
12 Petal.Width virginica 50 2.03 0.275 1.4 1.8 2 2.3 2.5
Or if the response of interest is Sepal.Length, the following code will give an idea of how Sepal.Length looks like in each Species.
> iris %>% group_by(Species) %>% my_skim(Sepal.Length)
-- Data Summary ------------------------
Values
Name Piped data
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
numeric 1
________________________
Group variables Species
-- Variable type: numeric --------------------------------------------------------------------
# A tibble: 3 x 10
skim_variable Species n mean sd p0 p25 p50 p75 p100
* <chr> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Sepal.Length setosa 50 5.01 0.352 4.3 4.8 5 5.2 5.8
2 Sepal.Length versicolor 50 5.94 0.516 4.9 5.6 5.9 6.3 7
3 Sepal.Length virginica 50 6.59 0.636 4.9 6.22 6.5 6.9 7.9
No I was wrong but right now
my_skim <- skim_with(base = sfl(n = length))
iris %>% group_by(Species) %>% my_skim()
Seems to work. But that's for all variables not just numeric. If you want it just for numeric you'd change my_skim.
I polled some skimr
users on Twitter, and the results suggest that we shouldn't switch the defaults.
https://twitter.com/michaelquinn32/status/1230317868811612161?s=20
Otherwise, previously-listed solutions should give you a my_skim()
function that gets the results you want. Thanks for all of the feedback!
I am updating my package (https://github.com/IndrajeetPatil/groupedstats/issues/16) to work with
skimr v2
, but I realized that one of the key columns (for my purposes) has been removed: the number of data points (n
).Is there any particular reason why
skimr::skim
doesn't containnumeric.n
?n
column present)numeric.n
column absent)