ropensci / git2rdata

An R package for storing and retrieving data.frames in git repositories.
GNU General Public License v3.0
99 stars 12 forks source link

more liberal approach when writing factors #13

Closed ThierryO closed 5 years ago

ThierryO commented 5 years ago

The current policy is that any change in factor levels invalidates the current metadata. Thus requiring the overwrite the metadata when storing the new version of the dataframe. In some cases we can savely update the metadata while keeping the diffs minimal.

We could store both the index numbers and their labels in the metadata, including their order. factor(levels = c("a", "c")) is now stored as

 class: factor
     - "a"
     - "c"

Writing a new version with factor(levels = c("a", "b", "c")) currently changes the metadata to the example below. In combination with optimize = TRUE, the index of 'c' changes from 2 to 3, resulting is potentially large diff in the data.

 class: factor
     - "a"
     - "b"
     - "c"

The alternative would be to store the metadata as

class: factor
    1. "a"
    2. "c"
order: 1, 2

Adding the factor level will update the metadata to the example below, leaving the index number for level "c" unchanged.

class: factor
    1. "a"
    2. "c"
    3. "b"
order: 1, 3, 2

In case a factor level is dropped, we drop it from the order and remove the level from the metadata.

Any thoughs on this @stijnvanhoey and @florisvdh

florisvdh commented 5 years ago

The level index that you refer to, will thus be the number that is used in the optimized data frame format. Your main point is to keep the index for each specific level constant. Once a level is no longer defined for the factor, it can be removed both from levels and order.

This own numbering system would indeed solve the caveats that I met before, and remove the need to circumvent this by reordering the factor levels in the dataframe itself.

An alternative could be to combine both sources of information as follows (index and order):

class: factor
    1. "a"
    3. "b"
    2. "c"

This is less self-explanatory (therefore less user friendly), however more concise and still smaller metadata-diffs. It should be explained in documentation anyway. To be weighed further against coding effort and against computing efficiency, I guess.

stijnvanhoey commented 5 years ago

As factors do have an order and if you want to keep the enum approach for optimize = TRUE, you will have to provide some order representation and bookkeeping.

Maybe, the following representation is a bit more intuitive to me:

class: factor
    - "a" : 1
    - "c" : 2
    - "b" : 3
order: 1, 3, 2

or taking into account comment Floris:

class: factor
    - "a" : 1
    - "b" : 3
    - "c" : 2
florisvdh commented 5 years ago

The level representation of @stijnvanhoey (x : 1) is more intuitive to me as well.

ThierryO commented 5 years ago

I'll go for a slightly different syntax. More verbose but much easier to read and write using the yaml package. Multiline level labels are also possible with this syntax.

- a
- b
- c
- 1
- 2
- 3
florisvdh commented 5 years ago

Don't you mean levels rather than labels? Does the package cope with factor labels that are different from the factor levels?

Further, does the order of both labels and indexes reflect the order of the factor levels? The order will have to be maintained both in the label and index lists.

ThierryO commented 5 years ago

a factor level is defined by its index and its label. The first level is the example below has label 'd' and index 4.

and yes, the order of the metadata matters. It defines the order of the levels (and also the order of the variables).

An updated version of the levels might look like this

- d
- c
- a
- 4
- 3
- 1
ThierryO commented 5 years ago

@florisvdh, can you review these changes? write_vc() reuses factor indices with stable labels. relabel() can change the labels without changing in the index.

florisvdh commented 5 years ago

I have success with relabel(). However, the root default is not yet functional.

I have no success with write_vc() though. Perhaps I did not choose the targeted use cases, but anyway a notebook is attached.

florisvdh commented 5 years ago

Works conveniently for factors. Congratulations!