tmatta / lsasim

Simulate large scale assessment data
6 stars 5 forks source link

improve error when invalid item_parameters is passed to block_design #50

Closed pdbailey0 closed 6 months ago

pdbailey0 commented 6 months ago

I used block_design and passed it a item_parameters with column item that was the integers from 1:nrow(item_parameters) this caused a very difficult to read error.

Error in rep(0, (max(block_len) - length(block_items))) : 
  invalid 'times' argument

There are many possible fixes for this. for my purposes, it worked to just overwrite item to be 1:nrow(item_parameters), but doing that would cause issues if you expected to know the item column outside of block_design, so I made a fork that instead just gives a helpful error an exists.

Minimal working example:

param <- NAEPirtparams::parameters
item_par <- param[param$level %in% 8 & param$subject %in% "Mathematics" & param$year %in% 2015, ]
item_par$item <- sample(1:1e5, nrow(item_par))
item_par <- item_par[1:32,]
# this gives the error I share above about invalid `times` argument
block <- lsasim::block_design(n_blocks = 10, item_parameters = item_par)

My proposed version makes this error

devtools::install_github("pdbailey0/lsasim")
param <- NAEPirtparams::parameters
item_par <- param[param$level %in% 8 & param$subject %in% "Mathematics" & param$year %in% 2015, ]
item_par$item <- sample(1:1e5, nrow(item_par))
item_par <- item_par[1:32,]
block <- lsasim::block_design(n_blocks = 10, item_parameters = item_par)
# Error in lsasim::block_design(n_blocks = 10, item_parameters = item_par) : 
#   the item_parameters argument must have a column named item that contains the integers from one to the number of rows in item_parameters

I think fixed my bug with

item_par$item <- 1:32
# now it runs fine
block <- lsasim::block_design(n_blocks = 10, item_parameters = item_par)

If you like this, I'll propose a pull request for you to review.

wleoncio commented 6 months ago

Hi Paul,

This error message can indeed be a bit cryptic. What happens is that the second argument of the rep() function (i.e., the times argument) can't be negative, which is what happens in your example (see objects below):

Browse[1]> as.numeric(c(block_items, rep(0, (max(block_len) - length(block_items)))))
Error in rep(0, (max(block_len) - length(block_items))) : 
  invalid 'times' argument

Browse[1]> block_len
 [1] 0 0 0 0 0 0 0 0 0 0

Browse[1]> block_items
[1] "1"  "11" "21" "31"

Browse[1]> max(block_len)
[1] 0

Browse[1]> length(block_items)
[1] 4

Browse[1]> max(block_len) - length(block_items)
[1] -4

Browse[1]> as.numeric(c(block_items, rep(0, (max(block_len) - length(block_items)))))
Error in rep(0, (max(block_len) - length(block_items))) : 
  invalid 'times' argument

The fix on your post resolves this because it produces a block_len with a max value of 4, which results in a valid, non-negative argument for rep()

Browse[1]> block_len
 [1] 4 4 3 3 3 3 3 3 3 3

Browse[1]> block_items
[1] "1"  "11" "21" "31"

Browse[1]> max(block_len) - length(block_items)
[1] 0

Browse[1]> as.numeric(c(block_items, rep(0, (max(block_len) - length(block_items)))))
[1]  1 11 21 31

The solution on your fork (https://github.com/pdbailey0/lsasim/commit/1b6a487da228b525d1ad35d821c336355cb8f4f6) works alright, but the problem could also be resolved by changing the way item_block_df$item is created, which is as a sequence from 1 to n_items (i.e., 1:32 in your example). This is the bit of code reponsible for this:

https://github.com/tmatta/lsasim/blob/5b4664d001fad06bf9adb1655ee30093282a7cd6/R/block_design.R#L56-L57

So the following change to line 56 also fixes the problem:

item_block_df$item <- item_parameters[["item"]]

See full reproduction code on https://gist.github.com/wleoncio/25c7467a89380b108204975bd6d81774

From my tests, this doesn't seem to break anything in other places of the package, and I think it offers more flexibility to the block_design() function than forcing the user to use a particular sequence in their item numbering, but I'd like to hear your thoughts on the matter before we decide on a fix.

Thanks again for your valuable contribution!

mollyolaf commented 6 months ago

Thanks for keeping up on this, Waldir! I hope all is well.

Leslie

On Mon, Apr 22, 2024 at 4:37 AM Waldir Leoncio @.***> wrote:

Hi Paul,

This error message can indeed be a bit cryptic. What happens is that the second argument of the rep() function (i.e., the times argument) can't be negative, which is what happens in your example (see objects below):

Browse[1]> as.numeric(c(block_items, rep(0, (max(block_len) - length(block_items)))))Error in rep(0, (max(block_len) - length(block_items))) : invalid 'times' argument Browse[1]> block_len [1] 0 0 0 0 0 0 0 0 0 0 Browse[1]> block_items [1] "1" "11" "21" "31" Browse[1]> max(block_len) [1] 0 Browse[1]> length(block_items) [1] 4 Browse[1]> max(block_len) - length(block_items) [1] -4 Browse[1]> as.numeric(c(block_items, rep(0, (max(block_len) - length(block_items)))))Error in rep(0, (max(block_len) - length(block_items))) : invalid 'times' argument

The fix on your post resolves this because it produces a block_len with a max value of 4, which results in a valid, non-negative argument for rep()

Browse[1]> block_len [1] 4 4 3 3 3 3 3 3 3 3 Browse[1]> block_items [1] "1" "11" "21" "31" Browse[1]> max(block_len) - length(block_items) [1] 0 Browse[1]> as.numeric(c(block_items, rep(0, (max(block_len) - length(block_items))))) [1] 1 11 21 31

The solution on your fork @.** https://github.com/pdbailey0/lsasim/commit/1b6a487da228b525d1ad35d821c336355cb8f4f6) works alright, but the problem could also be resolved by changing the way item_block_df$item is created, which is as a sequence from 1 to n_items (i.e., 1:32 in your example)*. This is the bit of code reponsible for this:

https://github.com/tmatta/lsasim/blob/5b4664d001fad06bf9adb1655ee30093282a7cd6/R/block_design.R#L56-L57

So the following change to line 56 also fixes the problem:

item_block_df$item <- item_parameters[["item"]]

See full reproduction code on https://gist.github.com/wleoncio/25c7467a89380b108204975bd6d81774

From my tests, this doesn't seem to break anything in other places of the package, and I think it offers more flexibility to the block_design() function than forcing the user to use a particular sequence in their item numbering, but I'd like to hear your thoughts on the matter before we decide on a fix.

Thanks again for your valuable contribution!

— Reply to this email directly, view it on GitHub https://github.com/tmatta/lsasim/issues/50#issuecomment-2068816795, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACSQOEF52VS226TILP5XPVTY6TD4BAVCNFSM6AAAAABGP3QB4GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRYHAYTMNZZGU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

pdbailey0 commented 6 months ago

@wleoncio, if you do that, there is no need to merge on 57, you can just cbind the two data sets. Are you sure they're always in the correct order?

wleoncio commented 6 months ago

Hi Paul,

You're right, if item_parameters and item_block_df are in the same order, we could just do a cbind() on L57. The computational penalty for using merge() should be negligible, though, and the latter works even if the datasets are ordered differently.

On the other hand, merge() would break if there are multiple items with the same number. Is that likely? We could check for that beforehand with something like identical(length(item_block_df$item), length(unique(item_block_df$item))) and fail/adapt if necessary.

Perhaps a stronger argument against the current implementation (including the released version) is that it requires the input data (item_parameters) to have an item column. With cbind() we could drop that requirement, but since the only thing that matters for cbind is that the arguments have the same number of rows I'm afraid this could easily introduce hard-to-notice errors. Up to you users to tell me how much trouble it is to have an "item" column on the input. :)

P.S.: Hi @mollyolaf! My pleasure, good to see the package alive and kicking! :)

pdbailey0 commented 6 months ago

@wleoncio this is part of the growing pains of writing a package worth using, others (like me) use it--and sometimes we'll use it incorrectly. Some checking of assumptions makes sense, and I'd be happy to make suggestions, but I don't want to suggest pulls if you want to fix this yourself.

wleoncio commented 6 months ago

I'm currently more inclined to release the fix currently implemented on issue-50 (using merge()). I've added a couple of validations to return an informative error in the following cases:

You can use this script to reproduce the fix and the errors. In any case, the code below should install lsasim version 2.1.4-1713963651 with the proposed fix.

remotes::install_github("tmatta/lsasim", "issue-50")

Anything else we should check?

wleoncio commented 6 months ago

Dear all,

lsasim 2.1.5 is on CRAN with the proposed fix. Closing this for now, but if the problem persists feel free to post here so we can reopen the issue. Thank you again for your invaluable contribution!

pdbailey0 commented 6 months ago

@wleoncio amazing response time. Thank you for your attention!