pveber / morse

Companion R package for MOSAIC website
7 stars 5 forks source link

Accomodate new value for default.stringAsFactors() #286

Closed pveber closed 3 years ago

pveber commented 3 years ago

@virgile-baudrot @sandrinecharles

I love R, I just do. Compare:

$R
R version 3.6.0 (2019-04-26) -- "Planting of a Tree"
> default.stringsAsFactors()
[1] TRUE

and:

$R
R version 4.0.2 (2020-06-22) -- "Taking Off Again"
> default.stringsAsFactors()
[1] FALSE

This change revealed a bug in survData. While the documentation says:

            • ‘replicate’: a vector of class ‘integer’ or ‘factor’ for
              replicate identification. A given replicate value should
              identify the same group of individuals followed in time

it outputs a replicate column of class character when the corresponding column in the input dataframe is of this class. A conversion should be performed here, because I'm not sure how much the rest of the code relies on this field to be integer (that's a situation where static typing would have helped!). As of now I only got failure in mosaic unit tests but this is likely to affect the whole application.

The bug came out because the replicate column is often encoded with string values, and read.table used to encode those as a factor by default. This is not the case anymore, and this make the type check (and conversion) even more important. Correcting this and making the corresponding release would help a lot with mosaic. The needed change is I think very trivial but since I've not been handling the releases for a long time, I think it's better to let you handle it. Would it be ok?

sandrinecharles commented 3 years ago

What about R version 4.0.3, the current one? Anyway, I think Virgile can reasonably take care about this, right?

virgile-baudrot commented 3 years ago

The change appears with version 4.0.0 of R: see https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/

For the morse side, it seems to don't have any problem. The data we store have replicate even as numeric

>   data("cadmium1") ; class(cadmium1$replicate)
[1] "numeric"
>   data("cadmium2") ; class(cadmium2$replicate)
[1] "numeric"
>   data("chlordan") ; class(chlordan$replicate)
[1] "numeric"
>   data("copper") ; class(copper$replicate)
[1] "numeric"
>   data("dichromate") ; class(dichromate$replicate)
[1] "numeric"
>   data("zinc") ; class(zinc$replicate)
[1] "numeric"
>   data("propiconazole") ; class(propiconazole$replicate)
[1] "factor"
>   data("propiconazole_pulse_exposure") ; class(propiconazole_pulse_exposure$replicate)
[1] "factor"

And, it's the same after survData. Since we have to use survData, maybe we can ensure that all survDataobject have replicate of class character?

Now:

>   class(survData(cadmium1)$replicate)
[1] "numeric"
>   class(survData(propiconazole)$replicate)
[1] "factor"
virgile-baudrot commented 3 years ago

As of now I only got failure in mosaic unit tests

@pveber Tu pourrais me mettre le lien des tests unitaires que tu fais sur mosaic ?

pveber commented 3 years ago

@virgile-baudrot Thanks for having a look at it!

Here's a simple example that shows the problem:

> data(cadmium1)
> cadmium1$replicate <- as.character(cadmium1$replicate)
> class(cadmium1$replicate)
[1] "character"
> class(survData(cadmium1)$replicate)
[1] "character"

In my original post, I mentionned that the same issue can be obtained when reading the data with read.table, which I do in my unit test.

virgile-baudrot commented 3 years ago

ok. But what should be the class of "replicate"? Because, any class should work since it's used to isolate/filter a group

virgile-baudrot commented 3 years ago

Ok. I think I get your point. Since replicate is only a label, the type can be numeric or character... even logical if two class but it's weird! And we say integer or factor.

R is crazy 😄 :

> a = c("1", "2", 3)
> typeof(a)
[1] "character"
> b = c(1, 2, 3)
> typeof(b)
[1] "double"
> a == b
[1] TRUE TRUE TRUE
> class(a) == class(b)
[1] FALSE
> class(a[1]) == class(b[1])
[1] FALSE
> class(c("a", "b"))
[1] "character"
> typeof(c("a", "b"))
[1] "character"
> class(as.factor(c("a", "b")))
[1] "factor"
> typeof(as.factor(c("a", "b")))
[1] "integer"
> typeof(as.factor(c("a", "b")))
[1] "integer"
> as.factor(c("a", "b")) + 1
[1] NA NA
Warning message:
In Ops.factor(as.factor(c("a", "b")), 1) : ‘+’ not meaningful for factors
> c("a", "b") + 1
Error in c("a", "b") + 1 : 
  argument non numérique pour un opérateur binaire
sandrinecharles commented 3 years ago

I would say that column replicate should be filled in by any type of input left to the choice the user himself. That is letters, numbers, of combination of both...

pveber commented 3 years ago

And I thought I had seen it all :) I think first and foremost, whatever the function does, the doc should describe it correctly. Now should the doc be changed, or the behavior of the function? My point is that since the doc says replicate is an int or a factor, that fact might be used in other places of the code in the library. Making sure that allowing character values won't hurt somewhere else (including in mosaic) is a lot of (tedious) work. It'd be simpler I think to allow character values for replicate in the argument of survData but to convert them to factors in the value it produces. So that we have freedom for users and reliability for downstream code.

virgile-baudrot commented 3 years ago

OK. I'm doing that!

virgile-baudrot commented 3 years ago

should solve the issue

@pveber I let you close the issue if it's ok for you

zapashcanon commented 3 years ago

I confirm the tests are OK now for Mosaic (I'm the one who reported the issue to @pveber).

EDIT: do you plan to make a new release on CRAN ?

virgile-baudrot commented 3 years ago

EDIT: do you plan to make a new release on CRAN ?

Yes, I do

pveber commented 3 years ago

Very good, thank you all for taking care of this!