Closed SugarRayLua closed 3 months ago
Thanks for spotting this; I hadn't considered factors when I wrote this function!
I can recreate the bug as follows:
# Demo dataset with a factor column
df <- df_fct <- data.frame(
group = as.factor(c("x", "x", "y", "z")),
value = runif(4)
)
# tide() errors out
df2 <- tide::tide(df, FALSE) # change a 'group' value
# Error in Ops.factor(left, right) : level sets of factors are different
# In addition: Warning message:
# In edit.data.frame(df) : added factor levels in 'group'
# edit() works as intended
df3 <- edit(df) # change a 'group' value
# Warning message:
# In edit.data.frame(df_fct) : added factor levels in 'group'
Reason: there's a step inside the tide()
function that compares the original and edited dataframes with ==
. If one of the dataframes has a factor column that has been edited to add a new factor level, then the comparison will error (you can't compare factors that have different lengths).
df_edited <- utils::edit(df) # change a 'group' value
changed <- df == df_edited
# Error in Ops.factor(left, right) : level sets of factors are different
So, I should update that bit of code to handle factor comparisons correctly, perhaps by using setdiff()
instead of ==
.
There will also be a change needed to return useful reproducible code back to the user. The output can no longer be like df[1, 1] <- "a"
, since "a"
is an invalid factor level (it's not included as a level in the original column of the dataframe). So the returned code in this case would have to replace the old value with the new, but also add a level to the factor.
I'll try and take a look at this if I find some time!
Thanks for the response and looking into it!
I recently submitted a question on the concept of how to programmatically track sporadic changes one makes to a dataframe:
As a novice R programmer/user/analyst, I found it hard to understand why R wouldn't already have such a function or package besides yours to do this. I can't image that other analyst don't get sent source data that hasn't fully been proof read for typos or other sporadic inconsistencies and have a need to clean up the data when there isn't a "systemic" error in the data to use established base R/tidyverse functions to clean up. Besides the factor example above, while tide() worked great for small data frames that I tested it on, it didn't do so well on the current dataframe I'm working on cleaning up (appx 230 x 230 size). Tide() appropriately recorded the first few edits I made but then resulted with code saying that I entered columns and columns of "NA" which I did not and did not record any of the further legitimate entries I made in the editor. Unfortunately, I can't share the actual database, but I'd be happy in the future to do more testing of tide() to help with its development.
For now, I'm manually logging each sporadic correction I make in the database in an R script (i.e.: df["id1", "age"] = 30 df["id20", "weight"] = 70 Etc.. :-( )
Have a good rest of your week. :-)
Thanks again for your input. I've added what is hopefully a fix for the factor problem and added issue #9 for NA handling, which I may be able to look at another time. The update isn't well tested; let me know if it works as expected. I've added an example with factors to the README file.
I agree broadly with responses to your StackOverflow post: I would typically try and write a function to take my original dataset and reproducibly create the edits that need to be made. But I appreciate that this is easier when the edits are systematic. I used to work with transcribed ecological data where obvious human errors (usually mine!) needed correcting, but these were often sporadic and only a few lines of cleanup were needed.
Thanks a lot, @matt-dray!
I'll aim to test your update out over the next two weeks and give you feedback.
Perhaps as I get more experienced I'll understand how to create functions better that could capture the edits that I am concerned about.
Have a good weekend 😊
tide() is a great function! I've been looking all over the web for how to programmatically track changes I make to individual values in a dataset, thanks!
I did though notice that tide() didn't seem to work if my dataset contained factors-- when I was done editing such a dataset I got the:
Error in Ops.factor(left,right): level sets of factors are different
Error. However, when I attempted to edit and modify the same dataset with factors using edit(df), it doesn't give me that error.
Fyi :-)