Open mjskay opened 5 years ago
or maybe simpler is to create a function for this (sample_draws
and/or thin_draws
)
Closing this by deferring to the addition of sample_draws
as the right approach. Now that I've implemented it, it seems clear that the add-a-function-to-pipeline approach is right one here. That is much more in keeping with modularity/composability, and prevents the introduction of other annoying API issues in the future.
Hi Matthew, it looks like I'm chasing down something similar to what you were considering here, but I'm a little confused by your notes about how best to build what I want.
What I was trying to do: I have a Stan model that I want to (ideally) do something like spread_draws(stanfit_object, ..., n = 100)
to make the process of spreading less time-consuming (since it's a big model).
It looks like you have already implemented sample_draws
but it takes an already-tidy data frame of samples as data. Judging from other investigation of the codebase and from your notes here, is it correct that if I want to do this, I should build my own pipeline of that goes tidy_draws
, and then sampling however many draws I want, and then spread_draws
? I had this initial hunch that tidying the whole model would be expensive, but reading the src for these functions makes it look like spread_draws
calls tidy_draws
anyway and so it isn't any more expensive than if there were an n
argument in spread_draws
? Am I understanding this right?
Happy to clarify if I'm not making sense. Thanks!
Yeah, good question. I would reconsider adding an n
argument to spread_draws for exactly this kind of situation --- doing it later in the pipeline could be more costly for large models. If tidy_draws
accepted an n
argument it could do the subset before the tidying and this might make spread_draws
faster.
This also goes along with #82, which would allow you to easily implement the equivalent of tidy_draws
however you see fit (including using a subset of the draws) and then use spread_draws
on the resulting data frame directly. In the meantime a hackish solution would be to give your own implementation of tidy_draws.data.frame()
that just returns the data frame you pass it. That would be enough to "solve" #82 (without having additional checks for correctness one would want in a full solution). Does that make sense?
There's a (much) longer term idea here in having the tidybayes functions do some kind of internal query optimization so users don't have to deal with these things, but I only have so much time, so exposing an n
argument might be the better short term solution :). That being the case I'll reopen this.
I'm a little unclear on the original impetus for https://github.com/mjskay/tidybayes/issues/82 so let me see if I'm understanding. Since query optimization is (rightly) off the table, a next-best place to intervene on the pipeline to spread only a sample of draws (in the near-term) would be to pass a trimmed down MCMC matrix to tidy_draws
, but that won't work as currently written because tidy_draws
wants to check metadata about the original model that would clash with a trimmed-down MCMC matrix? So one could just write a home-use version of it that does less checking against the original model object?
Right, so if you wrote a really simple function like this:
tidy_draws.data.frame = function(model) model
That would allow you to pass arbitrary data frames directly into spread_draws
. However, it won't verify that the data frame is in the correct format (which a good implementation would actually do), and if it isn't, spread_draws()
will almost definitely choke somewhere. But if you constructed your own tidy data frame of draws that had a .chain
, .iteration
, and .draw
column where each row is one draw (i.e. no draws are spread over multiple rows) and every other column was a parameter from the model, I think you should then be able to pass that tidy data frame directly into spread_draws()
as long as you've defined the above function. Note: I haven't tested this.
The plan for #82 is to basically just write a better version of the above function that (at the very least) verifies the appropriate columns are present and of the right types. I might not do any other verification unless I can keep it fast.
Update: I verified that the above definition will allow you to pass tidy data frames of draws directly into spread_draws()
(again with the caveat that it can break if you don't make sure the input format is correct).
Update: I've added an implementation of tidy_draws()
for data frames (so you don't have to do it yourself) and (probably more useful) I've added n
and seed
arguments to spread_draws()
and gather_draws()
. Let me know if that gives the speedup you need.
If that doesn't work, I'll probably need to update tidy_draws()
to support n
and seed
, which will be a bit more work because it will need model-type-specific implementations to really be useful.
Neat! This definitely makes the interface easier. Unfortunately though I am not experiencing much speed-up now since I guess a lot of the bottleneck is probably happening during tidy_draws
, which appears to be tidying all of the MCMC draws still. I wish I knew more of the underlying OOP going on here with the original tidy_draws
so I could be more helpful on the dev side. In the meantime I'll put Advanced R back on the to-do list.
Sounds good! :)
I'm curious how slow just calling tidy_draws()
directly is for you compared to calling spread_draws()
on (say) one parameter? That would help me know if the slowdown is in tidy_draws()
or spread_draws()
. Depending on the model and the complexity of the spec in spread_draws()
it could actually be due to either function.
Incidentally, I just rewrote some pieces of the internals of spread_draws()
, and it should be much faster on typical specs now than it used to be. @mikedecr if you have a chance to try the latest github version (via devtools::install_github("mjskay/tidybayes")
I'd love to know if that helped at all.
Otherwise, if you let me know what sort of spread_draws()
spec you are writing (even just the arguments you are passing so I have an idea of the dimensions of the parameters involved) it would make it so that I could profile the code and better identify where the slowdown is.
This sounds really interesting! I can test drive this tomorrow and give you an example of the kind of thing I'm doing. Thanks!
ok sorry for the delay. Here's some info about what's going on. I have an IRT-type model 7,500 draws total of just over about 9,000 parameters/transformed params/generated q's in total.
tidy_draws
on the whole model takes about 25 seconds. spread_draws(stanfit, cutpoint[item], dispersion[item])
takes about 25 seconds also.n = 1
takes about 22 seconds. The 3-second decline is pretty consistent after trying it a few different times. Curious to know what you glean out of this!
Ah great, that's really helpful! The fact that spread_draws and tidy_draws take a similar amount of time (and this is independent of sample size) tells me the issue is likely due to an operation whose time is proportional to the number of parameters and not the number of samples, which is really useful to know. Also, given the model structure you've described I should be able to put together a benchmark case that I can use to replicate the slowdown, which I can then use to profile the code and see where specifically the problem is, then fix it.
Thanks so much for getting back to me on this, this is really helpful information!
Followup question: how many unique values of item
are there here: spread_draws(stanfit, cutpoint[item], dispersion[item])
?
And secondary followup question: how fast is spread_draws(stanfit, cbind(cutpoint, dispersion)[item])
?
Sorry I meant to specify: 40 items. Using cbind
doesn't really make a difference for me here, as far as I can tell.
Followup: I think I've narrowed this problem to tidy_draws()
, but if you have time for a quick check on your end it might help me be sure (and could give you an okay workaround in the meantime). Can you try:
Install dev version of tidybayes:
devtools::install_github("mjskay/tidybayes")
In a fresh R session, extract the draws first:
draws = tidy_draws(stanfit)
Then spread using the extracted draws, not the original stanfit
object:
spread_draws(draws, cutpoint[item], dispersion[item])
Based on my testing, I would expect (2) to be slow but (3) not to be so long as you have tidied the draws first, as above. Is that the case for you?
The problem is that the code for converting samples from the format used by the model into a tidy data frame is slow. The dev version of tidybayes is able to avoid doing that if you've already done the conversion (as in step 2), so if you do it once up front you shouldn't have to do it again in any subsequent spread_draws calls. Thus the subsequent calls will be fast.
Fortunately I should be able to make step (2) fast as well, so eventually this all shouldn't matter, but I wanted to make sure I've correctly found the problem.
Belatedly, yes I can confirm that spreading is waaay faster with the tidy draws saved ahead of time at this, even when I take a large number of samples. If the tidying step is something I should have been doing this whole time even before these changes, I apologize for the oversight on my part! It seems like it's a game changer going forward though, so thanks a lot!
Belatedly, yes I can confirm that spreading is waaay faster with the tidy draws saved ahead of time at this, even when I take a large number of samples.
Great! Thanks for confirming this.
If the tidying step is something I should have been doing this whole time even before these changes, I apologize for the oversight on my part!
Not at all --- in fact until I did some optimization based on your feedback in this issue it would not have worked at all, so thank you for raising the problem!
Add
ndraws
,seed
to:tidy_draws %>% sample_draws()
anyway).