matt-dray / tide

:ocean::pencil: R package: edit a data.frame in a spreadsheet-like editor, get code to reproduce it
https://www.rostrum.blog/2022/04/27/tide/
Other
7 stars 0 forks source link

Great function! Tide though doesn't seem to be able to handle changing factors like edit.frame does in base R #8

Closed SugarRayLua closed 3 months ago

SugarRayLua commented 3 months ago

tide() is a great function! I've been looking all over the web for how to programmatically track changes I make to individual values in a dataset, thanks!

I did though notice that tide() didn't seem to work if my dataset contained factors-- when I was done editing such a dataset I got the:

Error in Ops.factor(left,right): level sets of factors are different

Error. However, when I attempted to edit and modify the same dataset with factors using edit(df), it doesn't give me that error.

Fyi :-)

matt-dray commented 3 months ago

Thanks for spotting this; I hadn't considered factors when I wrote this function!

I can recreate the bug as follows:

# Demo dataset with a factor column
df <- df_fct <- data.frame(
  group = as.factor(c("x", "x", "y", "z")),
  value = runif(4)
)

# tide() errors out
df2 <- tide::tide(df, FALSE)  # change a 'group' value
# Error in Ops.factor(left, right) : level sets of factors are different
# In addition: Warning message:
#   In edit.data.frame(df) : added factor levels in 'group'

# edit() works as intended
df3 <- edit(df)  # change a 'group' value
# Warning message:
#   In edit.data.frame(df_fct) : added factor levels in 'group'

Reason: there's a step inside the tide() function that compares the original and edited dataframes with ==. If one of the dataframes has a factor column that has been edited to add a new factor level, then the comparison will error (you can't compare factors that have different lengths).

df_edited <- utils::edit(df)  # change a 'group' value
changed <- df == df_edited
# Error in Ops.factor(left, right) : level sets of factors are different

So, I should update that bit of code to handle factor comparisons correctly, perhaps by using setdiff() instead of ==.

There will also be a change needed to return useful reproducible code back to the user. The output can no longer be like df[1, 1] <- "a", since "a" is an invalid factor level (it's not included as a level in the original column of the dataframe). So the returned code in this case would have to replace the old value with the new, but also add a level to the factor.

I'll try and take a look at this if I find some time!

SugarRayLua commented 3 months ago

Thanks for the response and looking into it!

I recently submitted a question on the concept of how to programmatically track sporadic changes one makes to a dataframe:

https://stackoverflow.com/questions/78722892/most-efficient-way-to-reproducibly-and-programmatically-change-sporadic-cell-val?noredirect=1#comment138796481_78722892

As a novice R programmer/user/analyst, I found it hard to understand why R wouldn't already have such a function or package besides yours to do this. I can't image that other analyst don't get sent source data that hasn't fully been proof read for typos or other sporadic inconsistencies and have a need to clean up the data when there isn't a "systemic" error in the data to use established base R/tidyverse functions to clean up. Besides the factor example above, while tide() worked great for small data frames that I tested it on, it didn't do so well on the current dataframe I'm working on cleaning up (appx 230 x 230 size). Tide() appropriately recorded the first few edits I made but then resulted with code saying that I entered columns and columns of "NA" which I did not and did not record any of the further legitimate entries I made in the editor. Unfortunately, I can't share the actual database, but I'd be happy in the future to do more testing of tide() to help with its development.

For now, I'm manually logging each sporadic correction I make in the database in an R script (i.e.: df["id1", "age"] = 30 df["id20", "weight"] = 70 Etc.. :-( )

Have a good rest of your week. :-)

matt-dray commented 3 months ago

Thanks again for your input. I've added what is hopefully a fix for the factor problem and added issue #9 for NA handling, which I may be able to look at another time. The update isn't well tested; let me know if it works as expected. I've added an example with factors to the README file.

I agree broadly with responses to your StackOverflow post: I would typically try and write a function to take my original dataset and reproducibly create the edits that need to be made. But I appreciate that this is easier when the edits are systematic. I used to work with transcribed ecological data where obvious human errors (usually mine!) needed correcting, but these were often sporadic and only a few lines of cleanup were needed.

SugarRayLua commented 3 months ago

Thanks a lot, @matt-dray!

I'll aim to test your update out over the next two weeks and give you feedback.

Perhaps as I get more experienced I'll understand how to create functions better that could capture the edits that I am concerned about.

Have a good weekend 😊