Closed ThierryO closed 5 years ago
The level index that you refer to, will thus be the number that is used in the optimized data frame format. Your main point is to keep the index for each specific level constant. Once a level is no longer defined for the factor, it can be removed both from levels and order.
This own numbering system would indeed solve the caveats that I met before, and remove the need to circumvent this by reordering the factor levels in the dataframe itself.
An alternative could be to combine both sources of information as follows (index and order):
class: factor
levels:
1. "a"
3. "b"
2. "c"
This is less self-explanatory (therefore less user friendly), however more concise and still smaller metadata-diffs. It should be explained in documentation anyway. To be weighed further against coding effort and against computing efficiency, I guess.
As factors do have an order and if you want to keep the enum approach for optimize = TRUE
, you will have to provide some order representation and bookkeeping.
Maybe, the following representation is a bit more intuitive to me:
class: factor
levels:
- "a" : 1
- "c" : 2
- "b" : 3
order: 1, 3, 2
or taking into account comment Floris:
class: factor
levels:
- "a" : 1
- "b" : 3
- "c" : 2
The level representation of @stijnvanhoey (x : 1
) is more intuitive to me as well.
I'll go for a slightly different syntax. More verbose but much easier to read and write using the yaml package. Multiline level labels are also possible with this syntax.
labels:
- a
- b
- c
index:
- 1
- 2
- 3
Don't you mean levels rather than labels? Does the package cope with factor labels that are different from the factor levels?
Further, does the order of both labels and indexes reflect the order of the factor levels? The order will have to be maintained both in the label and index lists.
a factor level is defined by its index and its label. The first level is the example below has label 'd' and index 4.
and yes, the order of the metadata matters. It defines the order of the levels (and also the order of the variables).
An updated version of the levels might look like this
labels:
- d
- c
- a
index:
- 4
- 3
- 1
@florisvdh, can you review these changes? write_vc()
reuses factor indices with stable labels. relabel()
can change the labels without changing in the index.
I have success with relabel()
. However, the root default is not yet functional.
I have no success with write_vc()
though. Perhaps I did not choose the targeted use cases, but anyway a notebook is attached.
factors.nb.html.zip
Works conveniently for factors. Congratulations! factors.nb.html2.zip
The current policy is that any change in factor levels invalidates the current metadata. Thus requiring the overwrite the metadata when storing the new version of the dataframe. In some cases we can savely update the metadata while keeping the diffs minimal.
We could store both the index numbers and their labels in the metadata, including their order.
factor(levels = c("a", "c"))
is now stored asWriting a new version with
factor(levels = c("a", "b", "c"))
currently changes the metadata to the example below. In combination withoptimize = TRUE
, the index of 'c' changes from 2 to 3, resulting is potentially large diff in the data.The alternative would be to store the metadata as
Adding the factor level will update the metadata to the example below, leaving the index number for level "c" unchanged.
In case a factor level is dropped, we drop it from the
order
and remove the level from the metadata.Any thoughs on this @stijnvanhoey and @florisvdh