stan-dev / stan

Stan development repository. The master branch contains the current release. The develop branch contains the latest stable development. See the Developer Process Wiki for details.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
2.6k stars 370 forks source link

rhat and ess updates in analyze #2752

Open avehtari opened 5 years ago

avehtari commented 5 years ago

Summary:

In analyze service

Description:

2569 and #2575 moved effective sample size computation from chains.hpp to analyze service.

Move rhat calculation in the same way. In addition make minor changes to the effective sample size and rhat calculation. These changes make the computations more robust, but in most cases affect the computed values only a little. Especially these changes don't include rank normalization, folding or quantiles (these features will be added later).

The reference implementation in R is at https://github.com/stan-dev/rstan/blob/develop/rstan/rstan/R/monitor.R

Reproducible Steps:

Refrence values from R implementation

upath <- system.file('unitTests', package='rstan')
f1 <- file.path(upath, 'testdata', 'blocker1.csv')  
f2 <- file.path(upath, 'testdata', 'blocker2.csv')  
fit <- read_stan_csv(c(f1,f2))
monitor(fit)
sims <- as.array(fit)
rh <- rhat(sims[,,1])
checkEquals(rh, 1.007794, tolerance = 0.001); 
eb <- ess_bulk(sims[,,1])
checkEquals(eb, 465.3917, tolerance = 0.001); 
et <- ess_tail(sims[,,1])
checkEquals(et, 349.1353, tolerance = 0.001); 
em <- ess_mean(sims[,,1])
checkEquals(em, 467.8447, tolerance = 0.001); 

Same blocker1.csv and blocker2.csv files are in stan repo at src/test/unit/mcmc/test_csv_files/ and have been used in unit tests in stan repo, too.

EDIT: updated ess_mean reference value.

avehtari commented 5 years ago

ping @roualdes

roualdes commented 5 years ago

Thanks, @avehtari. I think autocovariance is in stan-dev/math. I'll start this week to work on these.

avehtari commented 5 years ago

I think autocovariance is in stan-dev/math

If it's used in elsewhere in math or exposed in Stan language, then it's better to re-implement in analyze

betanalpha commented 5 years ago

We had some discussion about this a long, long time ago. On one hand people argued that the math library should be just for the functions exposed to the autodiff library and on the other people argued that it should include all mathematical functionality used in Stan. That latter side won, which is why the autocorrelation and even the Welford accumulators used in the adaptation are in the math library.

That said, pushing everything into math does increase the testing and maintenance burden, where as having things contained within analyze would help distribute the burden and localize the dependences. I’d be okay going in that direction if there are no strong objections.

On Apr 21, 2019, at 9:51 PM, Aki Vehtari notifications@github.com wrote:

I think autocovariance is in stan-dev/math

If it's used in elsewhere in math or exposed in Stan language, then it's better to re-implement in analyze

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2752#issuecomment-485301184, or mute the thread https://github.com/notifications/unsubscribe-auth/AALU3FUUX7D4BVD6RXM6YYLPRUKY3ANCNFSM4HHKZHUA.

roualdes commented 5 years ago

The reference values from the R implementation was a great idea, thank you. Just one weird issue about it. When I run the following R code,

library(rstan)
source('monitor.R')
upath <- system.file('unitTests', package='rstan')
f1 <- file.path(upath, 'testdata', 'blocker1.csv')
f2 <- file.path(upath, 'testdata', 'blocker2.csv')
fit <- read_stan_csv(c(f1,f2))
sims <- as.array(fit)
ess_mean(sims[,,1])

I get the result 467.8447. I also get this value in C++ based on my best efforts to update the function stan::analyze::compute_effective_sample_size(). However, this number does not match the number you provided, 467.3676. Would you please confirm or disconfirm the results of this R code? I'm I reading the number you provided incorrectly? Other? Thanks.

avehtari commented 5 years ago

I get the result 467.8447.

I now get the same result. My previous result was by mistake without splitting the chains, ie, the result you would get from ess_rfun(sims[,,1]). Sorry for the confusion.

roualdes commented 5 years ago

Thanks for double checking.

Let me know what you guys think of the following functional spec. I propose to overload stan::analyze::compte_effective_sample_size() with

double compute_effective_sample_size(std::vector<const double*> draws, std::vector<size_t> sizes)
// and
double compute_effective_sample_size(const Eigen::MatrixBase<Derived>& draws)

which attempts to accomplish two things. The first signature aligns with our discussions on Discourse. The second signature allows for composing stan::analyze::compute_effective_sample_size() with stan::analyze::split_chains(), in the same way that monitor.R does, namely compute_effective_sample_size(split_chains(...)). The arguments to stan::analyze::split_chains() follow the same logic from our Discourse discussion:

Eigen::MatrixXd split_chains(const std::vector<const double*>& draws, const std::vector<size_t>& sizes)

To enable such a separation, without needlessly copying data around, I added an (intended to be) internal function

double effective_sample_size_impl(const Eigen::MatrixBase<DerivedA>& acov, const Eigen::MatrixBase<DerivedB>& chain_mean)

I agree that this adds much complication, to what should/could be a simple PR. However, the outcome is that the interfaces can call stan::analyze::compute_effective_sample_size() and stan::analyze::split_chains() separately without being penalized for needlessly copying data around, and more importantly, this will ease the future effort to adopt ess_bulk, which necessitates the separation of stan::analyze::compute_effective_sample_size() and stan::analyze::split_chains().

What do you guys think?

betanalpha commented 5 years ago

Moving towards an API that facilitates chain splitting, for both ESS and Rhat calculations, is the right step forwards.

I’m wondering if internally this all can’t be handled with Eigen maps? In other words, the main API exposes functions that take in a std::vector of double arrays and the first internal step is laying out the memory into Eigen maps which can then be split/passed around without any unnecessary copying. This would have the double bonus of using the memory provided by the client.

On Apr 26, 2019, at 4:00 PM, Edward A. Roualdes notifications@github.com wrote:

Thanks for double checking.

Let me know what you guys think of the following functional spec. I propose to overload stan::analyze::compte_effective_sample_size() with

double compute_effective_sample_size(std::vector<const double*> draws, std::vector sizes) // and double compute_effective_sample_size(const Eigen::MatrixBase& draws) which attempts to accomplish two things. The first signature aligns with our discussions on Discourse https://discourse.mc-stan.org/t/analysis-api/3486/30. The second signature allows for composing stan::analyze::compute_effective_sample_size() with stan::analyze::split_chains(), in the same way that monitor.R https://github.com/stan-dev/rstan/blob/24c9962a8f425f67453bb14034d5317633830fec/rstan/rstan/R/monitor.R#L378 does, namely compute_effective_sample_size(split_chains(...)). The arguments to stan::analyze::split_chains() follow the same logic from our Discourse discussion:

Eigen::MatrixXd split_chains(const std::vector<const double*>& draws, const std::vector& sizes) To enable such a separation, without needlessly copying data around, I added an (intended to be) internal function

double effective_sample_size_impl(const Eigen::MatrixBase& acov, const Eigen::MatrixBase& chain_mean) I agree that this adds much complication, to what should/could be a simple PR. However, the outcome is that the interfaces can call stan::analyze::compute_effective_sample_size() and stan::analyze::split_chains() separately without being penalized for needlessly copying data around, and more importantly, this will ease the future effort to adopt ess_bulk https://github.com/stan-dev/rstan/blob/24c9962a8f425f67453bb14034d5317633830fec/rstan/rstan/R/monitor.R#L307, which necessitates the separation of stan::analyze::compute_effective_sample_size() and stan::analyze::split_chains().

What do you guys think?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2752#issuecomment-487182981, or mute the thread https://github.com/notifications/unsubscribe-auth/AALU3FQ6BB5LFQPXU2TR6Z3PSNNN7ANCNFSM4HHKZHUA.

bob-carpenter commented 5 years ago

What do you guys think?

I'm on board because I'm the one who was suggesting something like this in the first place to have a baseline implementation. We probably need to do the same thing with our probability functions to cut down on compile times.

I’m wondering if internally this all can’t be handled with Eigen maps?

Having aT* is a necessary and sufficient condition to create an Eigen::Map<Matrix<T, ...>>. One of the motivations here is the difficulty they've had with PyStan integrating Eigen data types through Cython and the desire to provide a non-Eigen interface.

This would have the double bonus of using the memory provided by the client.

There are competing issues of speed and scalability. The roadblock is that draws come in row-by-row (draw by draw, that is), but all the analysis is column-by-column (parameter by parameter). An efficient Eigen transpose involving a copy up front is probably worthwhile versus using non-memory local Maps everywhere. So it should be faster to transpose once, but it requires twice the dynamic memory overhead as now you have the original memory and a copy.

roualdes commented 5 years ago

I’m wondering if internally this all can’t be handled with Eigen maps?

I thought about and tried this, but I convinced myself that it won't work. An Eigen map is created with a pointer to a contiguous block of memory and the dimension(s). The trouble is, as far as I know, std::vector<const double*>& draws will not necessarily keep all arrays next to each other, despite the fact that the elements of each double* are contiguous.

This also means split_chains() necessarily does a copy.

If you've got 5 - 20 minutes to educate me about Eigen maps or convince yourself that I'm on to something, please do.

Here's my a link to my work-in-progress branch, which is on my fork of stan.

betanalpha commented 5 years ago

I wasn’t thinking of merging everything into a matrix via a map but rather handling the initial chains, or any splitting there of, as a collection of vector maps.

I guess it all comes down to how the autocorrelations are handled. Welford accumulator estimators can run on each chain segment and hence don’t need everything to be in one big contiguous block of memory. Plus the individual estimators can be computed in parallel. On the other hand FFT estimators, as are currently employed, probably need the continuity.

Anyways, to maintain flexibility moving forwards I think we want to have something that flows like

input chain segments -> splitting into more chain segments -> analysis code that reads in an arbitrary number of chain segments

In other words any copying/aggregation into contiguous memory shouldn’t happen until the analysis code. In that way we can rewrite the underlying analysis code to avoid copying in the future if desired.

On Apr 27, 2019, at 1:10 PM, Edward A. Roualdes notifications@github.com wrote:

I’m wondering if internally this all can’t be handled with Eigen maps?

I thought about and tried this, but I convinced myself that it won't work. An Eigen map is created with a pointer to a contiguous block of memory and the dimension(s). The trouble is, as far as I know, std::vector<const double>& draws will not necessarily keep all arrays next to each other, despite the fact that the elements of each double are contiguous.

This also means split_chains() necessarily does a copy.

If you've got 5 - 20 minutes to educate me about Eigen maps or convince yourself that I'm on to something, please do.

Here's my a link to my work-in-progress branch https://github.com/stan-dev/stan/compare/develop...roualdes:feature/issue-2752-update-rhat-ess, which is on my fork of stan.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2752#issuecomment-487303518, or mute the thread https://github.com/notifications/unsubscribe-auth/AALU3FROB3C67NZTNMSZVADPSSCILANCNFSM4HHKZHUA.

bob-carpenter commented 5 years ago

On Apr 28, 2019, at 1:52 PM, Michael Betancourt notifications@github.com wrote:

I wasn’t thinking of merging everything into a matrix via a map but rather handling the initial chains, or any splitting there of, as a collection of vector maps.

That seems fine. It's the locality of the individual chains that really matters here as that's the unit at which we use the data.

I guess it all comes down to how the autocorrelations are handled. Welford accumulator estimators can run on each chain segment and hence don’t need everything to be in one big contiguous block of memory.

Right. It's just that reading out of contiguous memory is faster.

Plus the individual estimators can be computed in parallel. On the other hand FFT estimators, as are currently employed, probably need the continuity.

They assume an Eigen structure as input---it may not need to be literally stored as a dense matrix, but it'll probably do an internal copy if it's not for efficiency. So if FFT's already going to do that, we might as well do it ourselves to save it the trouble and let us use it elsewhere.

Anyways, to maintain flexibility moving forwards I think we want to have something that flows like

input chain segments -> splitting into more chain segments -> analysis code that reads in an arbitrary number of chain segments

In other words any copying/aggregation into contiguous memory shouldn’t happen until the analysis code. In that way we can rewrite the underlying analysis code to avoid copying in the future if desired.

The copying/aggreation into memory is happening in RStan, for example, on an iteration-by-iteration basis. In CmdStan, we just write to disk, but it's still local iteration-by-iteration.

I think reading all the data in and transposing then calculating R-hat, ESS, etc. will be more efficient. It's just that it's also more memory intensive.

roualdes commented 5 years ago

I wasn’t thinking of merging everything into a matrix via a map but rather handling the initial chains, or any splitting there of, as a collection of vector maps.

They way I've coded it, if you don't call stan::analyze::split_chains(), then it uses vector maps and the memory provided by the client. On the other hand, I couldn't figure out how to reshape a map. So anytime stan::analyze::split_chains() is called, something closer to

reading all the data in and transposing then calculating R-hat, ESS, etc

is happening.

As it's my understanding that I'm quite close to respecting the discussion points so far, I'm going to push on. I'm going to finish up rhat, some documentation, and a couple of other things before I submit a PR.

Thank you all for your time.

betanalpha commented 5 years ago

They way I've coded it, if you don't call stan::analyze::split_chains(), then it uses vector maps and the memory provided by the client. On the other hand, I couldn't figure out how to reshape a map.

I’m not sure why you’d have to deal with any reshaping — splitting the chains can be accomplished by creating a new collection of std::vectors that point to segments of the input std::vector memory.

So anytime stan::analyze::split_chains() is called, something closer to

reading all the data in and transposing then calculating R-hat, ESS, etc

is happening

This worries me a bit — the implementation of analyze should be independent of any possible intermediate splitting. If split chains might not be contiguous, and require require copying into contiguous memory, then the initial chains might also not be contiguous and require the copy.

In other words, analyze shouldn’t know whether the chains where split or not. It just consumes an array of chains segments and runs the diagnostics ignorant of where those segments came from.

In any case this can be addressed in the PR if you prefer.

roualdes commented 5 years ago

splitting the chains can be accomplished by creating a new collection of std::vectors that point to segments of the input std::vector memory

That's a reasonable way to split chains in order to use the memory provided by the client. But I still don't see how to use the memory provided by the client and separate split_chains() from compute_effective_sample_size(). Even if split_chains() created a new collection of std::vectors that point to segments of the input std::vector memory, then you'd still need to return a second std::vector that records the sizes of the new (halved) chains -- or put call split_chains() within the scope of compute_effective_sample_size(), which I'm trying to avoid.

I guess one could have split_chains() return an Eigen map of a new collection of std::vectors that point to segments of the input std::vector memory. But it's my understanding that a copy would be incurred as soon as an Eigen Map is returned from a function. If somebody knows something about the interplay between functions that return an Eigen map and return value optimization, please share.

Without forcing split_chains() to have multiple return values (std::pair or something), compute_effective_sample_size() still needs both the number of chains and the length of each chain; its arguments are const std::vector<const double*>& draws, const std::vector<size_t>& sizes.

I keep trying to do the calculations on Eigen data structures since an Eigen::MatrixXd carries along both the number of rows (iterations) and columns (number of chains). But reshaping an Eigen Map isn't obvious to me.

This worries me a bit — the implementation of analyze should be independent of any possible intermediate splitting.

That's fair, and I hope it was only due to my lack of clarity. Neither of the two functions I'm proposing to add, Eigen::MatrixXd split_chains(const std::vector<const double*>& draws, const std::vector<size_t>& sizes) nor double compute_effective_sample_size(const Eigen::MatrixBase<Derived>& draws) require knowledge of whether or not its arguments were/are split.

Under my proposal, if an interface wants to first split chains and then calculate ESS all in C++, they would call compute_effective_sample_size(split_chains(...)). As it stands, split_chains() copies into an Eigen::MatrixXd of double the number of chains and half the number of iterations. As far as I know, Eigen matrices are in column-major order. Thus, if the user calls split_chains() first, then all calculations would be performed on contiguous memory with respect to each chain. It's my understanding that this is essentially the transpose that Bob was speaking about.

On the other hand, if the user does not call split_chains() first, then the candidate function is (the already available) double compute_effective_sample_size(std::vector<const double*> draws, std::vector<size_t> sizes). This function, too, does not require knowledge of the number of chains nor the number of iterations whether or not the chains were/are split, but does directly use the memory provided by the client via Eigen vector maps.

betanalpha commented 5 years ago

But this requires two implementations of the analyze function, right? That’s what I want to move away from — it increases the maintenance and testing burden significantly and will add inertia to future tweaks and improvements.

std::vector<double*> is cheap, so is std::vector. Consequently there’s no issue with split_chains creating new versions of each to pass onto the analyze functions. Unless I’m missing something?

The “transpose” Bob is referring to comes about when one wants to use the FFT to compute the autocovariances of each parameter. If we move away from that functionality then we can do everything without needing memory locality between the chains and work with the contiguity guaranteed by the std::vectors alone.

On Apr 29, 2019, at 7:49 PM, Edward A. Roualdes notifications@github.com wrote:

splitting the chains can be accomplished by creating a new collection of std::vectors that point to segments of the input std::vector memory

That's a reasonable way to split chains in order to use the memory provided by the client. But I still don't see how to use the memory provided by the client and separate split_chains() from compute_effective_sample_size(). Even if split_chains() created a new collection of std::vectors that point to segments of the input std::vector memory, then you'd still need to return a second std::vector that records the sizes of the new (halved) chains -- or put call split_chains() within the scope of compute_effective_sample_size(), which I'm trying to avoid.

I guess one could have split_chains() return an Eigen map of a new collection of std::vectors that point to segments of the input std::vector memory. But it's my understanding that a copy would be incurred as soon as an Eigen Map is returned from a function. If somebody knows something about the interplay between functions that return an Eigen map and return value optimization, please share.

Without forcing split_chains() to have multiple return values (std::pair or something), compute_effective_sample_size() still needs both the number of chains and the length of each chain; its arguments are const std::vector<const double*>& draws, const std::vector& sizes.

I keep trying to do the calculations on Eigen data structures since an Eigen::MatrixXd carries along both the number of rows (iterations) and columns (number of chains). But reshaping an Eigen Map isn't obvious to me.

This worries me a bit — the implementation of analyze should be independent of any possible intermediate splitting.

That's fair, and I hope it was only due to my lack of clarity. Neither of the two functions I'm proposing to add, Eigen::MatrixXd split_chains(const std::vector<const double*>& draws, const std::vector& sizes) nor double compute_effective_sample_size(const Eigen::MatrixBase& draws) require knowledge of whether or not its arguments were/are split.

Under my proposal, if an interface wants to first split chains and then calculate ESS all in C++, they would call compute_effective_sample_size(split_chains(...)). As it stands, split_chains() copies into an Eigen::MatrixXd of double the number of chains and half the number of iterations. As far as I know, Eigen matrices are in column-major order. Thus, if the user calls split_chains() first, then all calculations would be performed on contiguous memory with respect to each chain. It's my understanding that this is essentially the transpose that Bob was speaking about.

On the other hand, if the user does not call split_chains() first, then the candidate function is (the already available) double compute_effective_sample_size(std::vector<const double*> draws, std::vector sizes). This function, too, does not require knowledge of the number of chains nor the number of iterations, but does directly use the memory provided by the client via Eigen vector maps.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2752#issuecomment-487782058, or mute the thread https://github.com/notifications/unsubscribe-auth/AALU3FSU6FWNJXCH7XRRSKDPS6CQVANCNFSM4HHKZHUA.

bob-carpenter commented 5 years ago

On Apr 29, 2019, at 11:06 PM, Michael Betancourt notifications@github.com wrote:

But this requires two implementations of the analyze function, right? That’s what I want to move away from — it increases the maintenance and testing burden significantly and will add inertia to future tweaks and improvements.

std::vector<double*> is cheap, so is std::vector.

I wouldn't call either cheap in the sense that both manage their own heap-based memory. But if this is a std::vector<double> where the vector is a vector of draws, each of which is a double, then that's going to be super slow for any parameter-wise operations. Now we could go through and do something like streaming updates Welford-style, but when we have a lot of parameters, we'd have to be careful to stream in and out the accumulators efficiently.

Consequently there’s no issue with split_chains creating new versions of each to pass onto the analyze functions. Unless I’m missing something?

You mean you'd have a split view of some kind? Unlike strings, there's no efficient way to create a view. But we could do that some other way without literally splitting.

Having to have copies of everything in memory all at once is super expensive when viewed as max dynamic memory required. That's why the transpose solution is expensive.

The “transpose” Bob is referring to comes about when one wants to use the FFT to compute the autocovariances of each parameter.

It's literally a transpose if the draws are organized as a (draw x parameter) matrix.

If we move away from that functionality then we can do everything without needing memory locality between the chains and work with the contiguity guaranteed by the std::vectors alone.

What's contiguous, draws, or parameters?

betanalpha commented 5 years ago

I wouldn't call either cheap in the sense that both manage their own heap-based memory. But if this is a std::vector<double> where the vector is a vector of draws, each of which is a double, then that's going to be super slow for any parameter-wise operations. Now we could go through and do something like streaming updates Welford-style, but when we have a lot of parameters, we'd have to be careful to stream in and out the accumulators efficiently.

I think we’re talking past each other here.

The interfaces are supposed to supply their chains as a std::vector<double*> of addresses pointing to contiguous memory where the chains states begin with a std::vector defining the chain lengths, right? They’re never passing the states themselves, just addresses to the relevant chunks of memory.

Consequently the std vectors themselves don’t take up much space and are cheap to brute force copy, no?

You mean you'd have a split view of some kind? Unlike strings, there's no efficient way to create a view. But we could do that some other way without literally splitting.

I’m thinking of this as the interfaces supplying vectors of addresses to to contiguous blocks of memory that hold the chain values. Consequently it’s cheap to create new vectors with additional addresses corresponding to where the chains are split.

Having to have copies of everything in memory all at once is super expensive when viewed as max dynamic memory required. That's why the transpose solution is expensive.

Right, this is another issue. If we can break down the analysis code to handle one parameter at a time then we can drastically reduce the memory overhead. There are streaming algorithms for everything needed (I have code for autocorrelations, etc) and that would really boost the scalability of the functionality.

What's contiguous, draws, or parameters?

Both within a chain?

roualdes commented 5 years ago

But this requires two implementations of the analyze function, right? That’s what I want to move away from.

Yes. And I agree, this design choice suffers from all the faults you mention. I'm just struggling to see an alternative that doesn't involve overloading compute_effective_sample_size() in even more complex ways.

@betanalpha, would you please be a little more explicit about how you propose to keep split_chains() and compute_effective_sample_size() separate? The ESS calculation will need the lengths of the halved chains, but I don't want to force compute_effective_sample_size() to always split within this function itself. compute_effective_sample_size() shouldn't do the splitting since the functions within monitor.R show us that ess_mean() and ess_bulk() require splitting and ESS calculations to happen independently from each other.

bob-carpenter commented 5 years ago

The interfaces are supposed to supply their chains as a std::vector<double*> of addresses pointing to contiguous memory where the chains states begin with a std::vector defining the chain lengths, right?

I still don't know the memory layout, so let's walk through an example. Let's say I have three chains of two draws and four parameters each:

chain, iteration, theta1, theta2, theta3
1, 1, v1, v2, v3
1, 2, v4, v5, v6
1, 3, v7, v8, v9

2, 1, v10, v11, v12
2, 2, v13, v14, v15
2, 3, v16, v17, v18

How do the 18 values get laid out in memory?

If we're reading out of the sampler produced CSV, these are stored row-major as above, though the chains would be in separate files.

For analysis of things like R-hat and ESS, we need to access to the values in column major unless you're anticipating some kind of Welford-like sweep. That might be more efficient than transposing, but I doubt it when there are more than a hundred or two hundred parameters.

They’re never passing the states themselves, just addresses to the relevant chunks of memory. Consequently the std vectors themselves don’t take up much space and are cheap to brute force copy, no?

Just copying the pointer (reference) to a std::vector is O(1). Making a copy is O(N) in both time and space where N is the size of the vector. This requires malloc on the heap and calls to both constructor and destructor before all is said and done. But the copy itself is relatively cheap as it's all just contiguous memory.

I’m thinking of this as the interfaces supplying vectors of addresses to to contiguous blocks of memory that hold the chain values. Consequently it’s cheap to create new vectors with additional addresses corresponding to where the chains are split.

OK, that make sense.

Having to have copies of everything in memory all at once is super expensive when viewed as max dynamic memory required. That's why the transpose solution is expensive.

There are streaming algorithms for everything needed (I have code for autocorrelations, etc) and that would really boost the scalability of the functionality.

OK, that makes sense no matter what we do because the streaming algorithms are also more numerically stable.

What's contiguous, draws, or parameters?

Both within a chain?

See above. We can't have both because memory is 1D. We get either

row major: v1, v2, v3, ..., v18

or

column major: v1, v4, v7, ... v18

betanalpha commented 5 years ago

Ah, okay, I see the main point of confusion. I was focusing on the chains and you have been talking about some of the issues within the chains.

Right now the agreement was that the interfaces would provide vectors of pointers to memory where the states of each chain are held. The chain memory is required to be contiguous.

I’m not sure that we made any agreement about how the chain states themselves should be organized, but we definitely should in order to move forwards!

There are two options for within chain organization. We could group by iteration,

param1_iter1, param2_iter1, param1_iter2, param2_iter2, ...

or by parameter,

param1_iter1, param1_iter2, …, param2_iter1, param2_iter2, ...

The latter is compatible with streaming output from a Markov chain and, in particular, how CmdStan output is naturally formatted. The former drastically facilitates chain splitting and streaming summary calculations.

If we group by iteration then the analysis code would have to “transpose” at least once, requiring a copy and putting the entire chain contents into memory. Splitting functions could still be implemented in a natural way, selection the appropriate values to reconstruct the subchain states grouped by iteration in a sequence of copies before each subchain is “transposed” within the analysis code.

If we group by parameter then the analysis code could run on the input memory directly requiring no copies. Splitting functions would then be straightforward to implement for a single parameter, but more troublesome if the parameters had to be split and then reconstructed into a single block of contiguous chain memory. This approach would also put more burden on the interfaces to corral their output into the marginal parameter values.

This raises one more possibility — the analysis functions, including the splitting, operate on a single parameter at a time. The signatures would look similar except that each chain pointer points to a contiguous block of memory holding just

param1_iter1, param1_iter2, …

Now splitting is straightforward and doesn’t require any copies and we need only one parameter in memory at a time allowing the functionality to scale to larger problems (and potentially even setting things up for streaming inputs to scale up arbitrarily big).

This third approach also requires a bit of work from the interfaces, but I think that it’s manageable. It’s certainly straightforward with some awk magic (or equivalent) from CmdStan csv output and not too problematic from RStan and PyStan which already have everything in memory.

Sorry for the initial confusion! Any thoughts about how to move forwards? I personally am intrigued by the third option because it allows us to

a) Move towards analyzing each parameter/random variable/ pushforward distribution one by one which is where things are moving.

b) Doesn’t obstruct allowing the interfaces to scale to large outputs, as least as much as possible.

Any thoughts?

On Apr 30, 2019, at 3:10 PM, Bob Carpenter notifications@github.com wrote:

The interfaces are supposed to supply their chains as a std::vector<double*> of addresses pointing to contiguous memory where the chains states begin with a std::vector defining the chain lengths, right?

I still don't know the memory layout, so let's walk through an example. Let's say I have three chains of two draws and four parameters each:

chain, iteration, theta1, theta2, theta3 1, 1, v1, v2, v3 1, 2, v4, v5, v6 1, 3, v7, v8, v9

2, 1, v10, v11, v12 2, 2, v13, v14, v15 2, 3, v16, v17, v18 How do the 18 values get laid out in memory?

If we're reading out of the sampler produced CSV, these are stored row-major as above, though the chains would be in separate files.

For analysis of things like R-hat and ESS, we need to access to the values in column major unless you're anticipating some kind of Welford-like sweep. That might be more efficient than transposing, but I doubt it when there are more than a hundred or two hundred parameters.

They’re never passing the states themselves, just addresses to the relevant chunks of memory. Consequently the std vectors themselves don’t take up much space and are cheap to brute force copy, no?

Just copying the pointer (reference) to a std::vector is O(1). Making a copy is O(N) in both time and space where N is the size of the vector. This requires malloc on the heap and calls to both constructor and destructor before all is said and done. But the copy itself is relatively cheap as it's all just contiguous memory.

I’m thinking of this as the interfaces supplying vectors of addresses to to contiguous blocks of memory that hold the chain values. Consequently it’s cheap to create new vectors with additional addresses corresponding to where the chains are split.

OK, that make sense.

Having to have copies of everything in memory all at once is super expensive when viewed as max dynamic memory required. That's why the transpose solution is expensive.

There are streaming algorithms for everything needed (I have code for autocorrelations, etc) and that would really boost the scalability of the functionality.

OK, that makes sense no matter what we do because the streaming algorithms are also more numerically stable.

What's contiguous, draws, or parameters?

Both within a chain?

See above. We can't have both because memory is 1D. We get either

row major: v1, v2, v3, ..., v18

or

column major: v1, v4, v7, ... v18

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2752#issuecomment-488077770, or mute the thread https://github.com/notifications/unsubscribe-auth/AALU3FWCH2UWG55YHQ3PMG3PTCKSRANCNFSM4HHKZHUA.

betanalpha commented 5 years ago

Any thoughts? I don't want to obstruct current work, but I do think that if we can figure out the right conventions now then we will significantly facilitate future work. Especially before any of the interfaces start using these functions and freeze in a convention by default.

roualdes commented 5 years ago

Your third idea seems best to me, but possibly because I can't differentiate it from what already exists. This is likely just my misunderstanding. Would you mind giving a short statement about how your third idea is different from what is in stan-dev/stan develop?

I need to review in more detail how each interface stores its chains. Just assuming each interface's maintainer will work it out does not make for a great holistic view. I should have better answers to Bob's questions about memory layout. After this semester ends, my time will free back up.

bob-carpenter commented 5 years ago

On May 3, 2019, at 11:54 AM, Michael Betancourt notifications@github.com wrote:

...

I missed this earlier.

There are two options for within chain organization. We could group by iteration,

param1_iter1, param2_iter1, param1_iter2, param2_iter2, ...

or by parameter,

param1_iter1, param1_iter2, …, param2_iter1, param2_iter2, ...

The latter is compatible with streaming output from a Markov chain and, in particular, how CmdStan output is naturally formatted.

I think that's backwards. If we have parameters mu and sigma, we get: mu[1], sigma[1], mu[2], sigma[2], ... out of CmdStan if you look at the CSV output. That is, it's local by draw, not by parameter.

There's also the issue of how to order by chain when there are multiple chains, but that's less of an issue.

Mitzi's building CmdStanPy to use pandas to do an implicit transpose and store a column-major data frame so the draws for a parameter in a chain are grouped together in memory.

The former drastically facilitates chain splitting and streaming summary calculations.

Isn't that the latter? I just think you get the order reversed here.

...

This raises one more possibility — the analysis functions, including the splitting, operate on a single parameter at a time.

I think that'd be cleanest to implement. Everything's done marginally anyway. If we get some multivariate summaries, we could add those later.

The signatures would look similar except that each chain pointer points to a contiguous block of memory holding just

param1_iter1, param1_iter2, …

Now splitting is straightforward and doesn’t require any copies and we need only one parameter in memory at a time allowing the functionality to scale to larger problems (and potentially even setting things up for streaming inputs to scale up arbitrarily big).

We also need to know how to index the chains.

This third approach also requires a bit of work from the interfaces, but I think that it’s manageable. It’s certainly straightforward with some awk magic (or equivalent) from CmdStan csv output and not too problematic from RStan and PyStan which already have everything in memory.

Sorry for the initial confusion! Any thoughts about how to move forwards? I personally am intrigued by the third option because it allows us to

a) Move towards analyzing each parameter/random variable/ pushforward distribution one by one which is where things are moving.

b) Doesn’t obstruct allowing the interfaces to scale to large outputs, as least as much as possible.

For (b), we'd need to not read the whole set of CSV files into memory, but read parameter by parameter if we really want to scale. If we do read them in, they have to be read into the right order of memory so they don't later need to be transposed, as that would require even more memory overhead.

betanalpha commented 5 years ago

Your third idea seems best to me, but possibly because I can't differentiate it from what already exists. This is likely just my misunderstanding. Would you mind giving a short statement about how your third idea is different from what is in stan-dev/stan develop?

It depends on the exact structure of the chains currently being used. I need to review in more detail how each interface stores its chains. Just assuming each interface's maintainer will work it out does not make for a great holistic view.

We don’t want to make using these functions onerous for the interface clients but at the same time this is our opportunity to establish a useful convention moving forwards to anchor the interfaces. I should have better answers to Bob's questions about memory layout.

Establishing a good convention will definitely be the first step forwards. After this semester ends, my time will free back up.

This definitely sounds like a summer project.

betanalpha commented 5 years ago

There are two options for within chain organization. We could group by iteration,

param1_iter1, param2_iter1, param1_iter2, param2_iter2, ...

or by parameter,

param1_iter1, param1_iter2, …, param2_iter1, param2_iter2, ...

The latter is compatible with streaming output from a Markov chain and, in particular, how CmdStan output is naturally formatted.

I think that's backwards. If we have parameters mu and sigma, we get: mu[1], sigma[1], mu[2], sigma[2], ... out of CmdStan if you look at the CSV output. That is, it's local by draw, not by parameter.

Sorry, you’re correct.

Mitzi's building CmdStanPy to use pandas to do an implicit transpose and store a column-major data frame so the draws for a parameter in a chain are grouped together in memory.

This requires putting everything into memory to do the transpose. What I’m hoping to set up is a convention that avoids having to transpose anything if the interface doesn’t want it to do.

The former drastically facilitates chain splitting and streaming summary calculations.

Isn't that the latter? I just think you get the order reversed here.

Yup.

The signatures would look similar except that each chain pointer points to a contiguous block of memory holding just

param1_iter1, param1_iter2, …

Now splitting is straightforward and doesn’t require any copies and we need only one parameter in memory at a time allowing the functionality to scale to larger problems (and potentially even setting things up for streaming inputs to scale up arbitrarily big).

We also need to know how to index the chains.

Does anything depend on the order of the chains? The calculations are all exchangeable, right?

This third approach also requires a bit of work from the interfaces, but I think that it’s manageable. It’s certainly straightforward with some awk magic (or equivalent) from CmdStan csv output and not too problematic from RStan and PyStan which already have everything in memory.

Sorry for the initial confusion! Any thoughts about how to move forwards? I personally am intrigued by the third option because it allows us to

a) Move towards analyzing each parameter/random variable/ pushforward distribution one by one which is where things are moving.

b) Doesn’t obstruct allowing the interfaces to scale to large outputs, as least as much as possible.

For (b), we'd need to not read the whole set of CSV files into memory, but read parameter by parameter if we really want to scale. If we do read them in, they have to be read into the right order of memory so they don't later need to be transposed, as that would require even more memory overhead.

Right — again I’m thinking of CSVs on the command line.
awk will scan through and dump only a single column into memory very efficiently without having to put everything into memory at any intermediate step. Not sure if that’s interpretable as an implicit transpose or not, but it demonstrates that it shouldn’t be hard for interfaces to prepare marginal chains efficiently.

The diagnostics could even then be parallelized...

bob-carpenter commented 5 years ago

On May 15, 2019, at 10:54 PM, Michael Betancourt notifications@github.com wrote:

There are two options for within chain organization. We could group by iteration,

param1_iter1, param2_iter1, param1_iter2, param2_iter2, ...

or by parameter,

param1_iter1, param1_iter2, …, param2_iter1, param2_iter2, ...

The latter is compatible with streaming output from a Markov chain and, in particular, how CmdStan output is naturally formatted.

I think that's backwards. If we have parameters mu and sigma, we get: mu[1], sigma[1], mu[2], sigma[2], ... out of CmdStan if you look at the CSV output. That is, it's local by draw, not by parameter.

Sorry, you’re correct.

Mitzi's building CmdStanPy to use pandas to do an implicit transpose and store a column-major data frame so the draws for a parameter in a chain are grouped together in memory.

This requires putting everything into memory to do the transpose.

Only the memory for the transposed form ever gets loaded into memory. It doesn't do a load then a transpose, which would be terrible. But it does require the full output to be loadable.

What I’m hoping to set up is a convention that avoids having to transpose anything if the interface doesn’t want it to do.

The transposition has to happen either implicitly or explicitly because the output comes per iteration and the analysis happens per parameter. What we could do, though, is just read one parameter at a time out of the CSV files. That'd be most scalable, but it'd be a lot of file system and I/O pressure to keep scanning CSV files. At some point, you'd want to move to a database with appropriate indexing.

...

We also need to know how to index the chains.

Does anything depend on the order of the chains? The calculations are all exchangeable, right?

Nope. Other than perhaps linking them up to their adaptation parameters, etc.

This third approach also requires a bit of work from the interfaces, but I think that it’s manageable. It’s certainly straightforward with some awk magic (or equivalent) from CmdStan csv output and not too problematic from RStan and PyStan which already have everything in memory.

Sorry for the initial confusion! Any thoughts about how to move forwards? I personally am intrigued by the third option because it allows us to

a) Move towards analyzing each parameter/random variable/ pushforward distribution one by one which is where things are moving.

b) Doesn’t obstruct allowing the interfaces to scale to large outputs, as least as much as possible.

For (b), we'd need to not read the whole set of CSV files into memory, but read parameter by parameter if we really want to scale. If we do read them in, they have to be read into the right order of memory so they don't later need to be transposed, as that would require even more memory overhead.

Right — again I’m thinking of CSVs on the command line. awk will scan through and dump only a single column into memory very efficiently without having to put everything into memory at any intermediate step.

OK, we're on the same page then.

Not sure if that’s interpretable as an implicit transpose or not,

It's even worse in that all the data gets read from the file system for each column. At least I don't think awk is clever enough to memory map. But even if it memory mapped, it'd be a paging disaster with files with lots of parameters.

So I'm not sure we're going to find file sizes where this will work. They'd have to be big enough to prohibit reading into memory, at which point, scanning the files multiple times may be prohibitive. Have you tried it?

betanalpha commented 5 years ago

The transposition has to happen either implicitly or explicitly because the output comes per iteration and the analysis happens per parameter. What we could do, though, is just read one parameter at a time out of the CSV files. That'd be most scalable, but it'd be a lot of file system and I/O pressure to keep scanning CSV files. At some point, you'd want to move to a database with appropriate indexing.

Sure. A lot will depend on how efficient the interface readers are, but given how incredibly fast awk is it should be possible to scan through individual parameters efficiently.

Not sure if that’s interpretable as an implicit transpose or not,

It's even worse in that all the data gets read from the file system for each column. At least I don't think awk is clever enough to memory map. But even if it memory mapped, it'd be a paging disaster with files with lots of parameters.

So I'm not sure we're going to find file sizes where this will work. They'd have to be big enough to prohibit reading into memory, at which point, scanning the files multiple times may be prohibitive. Have you tried it?

I’ve definitely run into problems where parsing parameters one at a time through awk is much faster than trying to manage the entire memory burden. This is common with models that contain lots of generated quantities useful for prediction but not for diagnostics. In other words the parameter-by-parameter analysis is nicely compatible with filtered diagnostics that consider only a subset of parameters.

roualdes commented 5 years ago

I'm still a few posts back, so bare with me as I try to catch up. RStan and PyStan both store draws x chains x parameters. R is always column-major. PyStan is column-major also, by forcing Fortran ordering via numpy. Based on this thread it also sounds like CmdStanPy will also be column-major via Pandas. So the answer to

How do the 18 values get laid out in memory?

is

column major: v1, v4, v7, ... v18

for RStan, PyStan, and CmdStanPy.

Lately, we are more interested in CSVs on the command line (presumably with CmdStan?). The dangling question is, do we need to read in all draws into memory and transpose, or is Awk fast enough to pull out specific columns of each CSV so as to store multiple chains of a single parameter? Here's some ballpark numbers from a quick experiment of essentially the following Stan program with 5_000_000 samples and 1 chain:

generated quantities {
  real z[10];
  for (i in 1:10)
    z[i] = normal_rng(0, 1);
}

Reading in the Stan ouput CSV file using numpy.loadtxt() and forcing column-major ordering (a transpose) took approximately 50 seconds. Reading in the third column of the same CSV file with Awk via

time awk -F "\"*,\"*" 'FNR>31 {print $3}' samples1.csv > /dev/null 2>&1

took roughly 4.5 seconds, with an additional ~half-second for each next column. To read in the 12th column, the above Awk command took about 9.5 seconds. Total that's about 70 seconds for columns 3 through 12; roughly 20 seconds slower than reading the entire CSV + transpose. This by no means intends to be a conclusive experiment, especially since I just learned enough Awk to do this. Instead, I'm just putting some numbers on the board.

Now, the above question, about Awk versus reading everything into memory and transposing, implicitly assumes that we've agreed on Michael's third option where

each chain pointer points to a contiguous block of memory holding just param1_iter1, param1_iter2, …

It doesn't seem like anybody disagrees about this, but I think we should at least make it explicit that we've agreed upon this much, if we in fact agree upon this. I'm all for this third option.

bob-carpenter commented 5 years ago

CmdStan stores a single chain per file, with memory locality by iteration. That is, with M draws and K parameters, we have the order

theta[1, 1], theta[1, 2], ..., theta[1, K], theta[2, 1], ..., theta[M, 1], ..., theta[M, K]

So that gives us the ordering

CmdStan CSV file layout: (chain, iteration, parameter)

When you run RStan, the fit object you get back is a huge structure. The draws are stored within a list of chains, where each chain is a list of parameters, and each parameter is an array of values indexed by iteration. So that gives us

RStan fit object layout: (chain, parameter, iteration)

I don't even know that much awk. It's going to scale worse as the number of parameters grows as it'll have to page over more of the file to get a single column.

bob-carpenter commented 5 years ago

In order to have the posterior processing be efficient (R-hat, n_eff, quantiles, etc.), the iteration needs to be the fastest moving index---that keeps a parameter local within a chain. The order of chain vs. parameter isn't so important, I don't think.

betanalpha commented 5 years ago

Lately, we are more interested in CSVs on the command line (presumably with CmdStan?). The dangling question is, do we need to read in all draws into memory and transpose, or is Awk fast enough to pull out specific columns of each CSV so as to store multiple chains of a single parameter? Here's some ballpark numbers from a quick experiment of essentially the following Stan program with 5_000_000 samples and 1 chain:

generated quantities { real z[10]; for (i in 1:10) z[i] = normal_rng(0, 1); } Reading in the Stan ouput CSV file using numpy.loadtxt() https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html and forcing column-major ordering (a transpose) took approximately 50 seconds. Reading in the third column of the same CSV file with Awk via

time awk -F "\",\"" 'FNR>31 {print $3}' samples1.csv > /dev/null 2>&1 took roughly 4.5 seconds, with an additional ~half-second for each next column. To read in the 12th column, the above Awk command took about 9.5 seconds. Total that's about 70 seconds for columns 3 through 12; roughly 20 seconds slower than reading the entire CSV + transpose. This by no means intends to be a conclusive experiment, especially since I just learned enough Awk to do this. Instead, I'm just putting some numbers on the board.

Could you try again with `awk -F’,’ just in case the regular expression is slowing the parsing down too much?

My concern isn’t so much super long chains — chains requiring 5e6 samples imply a model that’s fitting way too slowly and should be implemented better. Chains that long should be artifacts from the BUGS days!

In other words I think the more pressing issue is many columns corresponding to high-dimensional parameter spaces and lots of transformed parameters or generated quantities. Depending on which parsing tool one uses the speed considerations will more or less be the same, but the memory requirements will be a bit different.

In any case, I think that memory will end up being the biggest burden. I have numerous times encountered applications where the Stan output is too big to fit into memory at once and crashes RStan and/or PyStan (one chain but hundreds of parameters and lots of posterior predictions). The memory problems are even worse for people working on machines with less RAM or clusters with active memory limits.

Even if the repeated parsing becomes slower it upper bounds the memory burden at any given time which allows us to scale to any number of parameters being analyzed. The parsing and analysis can even be done in parallel to speed things along.

Now the interfaces might now need to take advantage of this feature, and probably wouldn’t if they continue to require having everything in memory at once, but moving forward with this approach in the Stan code will ensure that the interfaces could transition into different storage paradigms in the future as needed without any obstruction from the Stan services themselves.

Now, the above question, about Awk versus reading everything into memory and transposing, implicitly assumes that we've agreed on Michael's third option where

each chain pointer points to a contiguous block of memory holding just param1_iter1, param1_iter2, …

It doesn't seem like anybody disagrees about this, but I think we should at least make it explicit that we've agreed upon this much, if we in fact agree upon this. I'm all for this third option.

Definitely the first step is making this decision!

betanalpha commented 5 years ago

Agreed —what I like about these service methods operating on one parameter at at time is that they force the interfaces to manipulate their storage into this format and doesn’t require any burden on the Stan side.

On May 22, 2019, at 10:45 AM, Bob Carpenter notifications@github.com wrote:

In order to have the posterior processing be efficient (R-hat, n_eff, quantiles, etc.), the iteration needs to be the fastest moving index---that keeps a parameter local within a chain. The order of chain vs. parameter isn't so important, I don't think.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2752?email_source=notifications&email_token=AALU3FXQQBP4KVLMUDAS3UDPWVMALA5CNFSM4HHKZHUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV7JGXY#issuecomment-494834527, or mute the thread https://github.com/notifications/unsubscribe-auth/AALU3FWO6HVBPUW2Z6VNA4DPWVMALANCNFSM4HHKZHUA.

roualdes commented 5 years ago

Thanks, Bob, for catching my errors. My post 6 days ago did not dive deep enough into the internals of any interface. Here's another go, with references to chunks of code for two of the interfaces that make apparent the layout that Bob pointed out.

RStan's internal function get_kept_samples() highlights well the layout (chains, parameters, iterations). Similarly, PyStan's internal function of the same name get_kept_samples() is a relatively easy place to identify the layout (chains, parameters, iterations). For CmdStan, I haven't found a single chunk of code that indicates so clearly the layout that Bob mentions (chains, parameters, iterations).

Since RStan and PyStan's final dimension is a std::vector<double>, this should ensure that within each chain a single parameter's draws are contiguous. These interfaces then satisfy the locality issue for Michael's third option, correct?

CmdStan maintains all samples in the type Eigen::Matrix<Eigen::MatrixXd, Dynamic, 1> for the layout (chains, iterations, parameters), which as far as I can tell is essentially a vector of Eigen::MatrixXds. Each Eigen::MatrixXd is then laid out as (iterations, parameters). Since Eigen is column-major, this too satisfies the locality issue for Michael's third option, correct?

Could you try again with `awk -F’,’ just in case the regular expression is slowing the parsing down too much?

Yes, though I didn't notice any significant difference in the run times. Further, I tried an example with 5000 iterations from

generated quantities {
  real z[5000];
  for (i in 1:5000)
    z[i] = normal_rng(0, 1);
}

Awk was orders of magnitude slower than using Numpy's function loadtxt(). For instance, with Awk, reading in the 2500th column from a single chain takes my machine just over 2 seconds. With Numpy, I could read in all chains' samples in just under 50 seconds.

On the other hand, with more iterations it was relatively easy to overwhelm my machine by loading all samples into memory. Awk did not hit this memory constraint in the examples I tried.

riddell-stan commented 5 years ago

@roualdes a small aside re:

RStan and PyStan both store draws x chains x parameters. R is always column-major. PyStan is column-major also, by forcing Fortran ordering via numpy.

Based on @bgoodri 's recommendation, the plan has been to move to params x draws x chains at some point in the future.


I've been meaning to try to use compute_effective_sample_size in the pystan 3 alpha. I hope to have time to work on this in June. I'll share any feedback I have, of course.

sakrejda commented 5 years ago

From @bob-carpenter

I don't even know that much awk. It's going to scale worse as the number of parameters grows as it'll have to page over more of the file to get a single column.

OTOH you only have to touch each column once and if you know that you're touching contiguous sets of columns then awk/file-reading approach scales fine (you only have to get the offset once per row per set and after the first hit you're just counting commas). Those sets can be saved into more convenient formats so you only take the hit once. If you store CmdStan output in binary format the file-reading is all pre-indexed as it would be in a DB so you only take the hit once.

roualdes commented 5 years ago

the plan has been to move to params x draws x chains at some point in the future.

@riddell-stan, thanks for the comment. Just for clarity, and because I previously lacked such distinction, is the plan to move the user facing structure or the internal structure of samples to (parameters, iterations, chains)?

For instance, right now, if a user fits a model named fit and then in PyStan calls fit.extract(permuted=False) or in RStan calls extract(fit, permuted=FALSE), then you get out an object of dimensions (iterations, chains, parameters). However, and more importantly for the topic of this thread, internally (as far as I can tell) both RStan and PyStan store the samples as (chains, parameters, iterations).

If the plan is to move the user facing structure to (parameters, iterations, chains), but maintain the same internal structure, then I don't think this plan will pertain to our current discussion.

bob-carpenter commented 5 years ago

I'm not sure what you mean by contiguous columns. If the data is organized on file in row-major format (as it is in our CSV outputs and as it comes off the samplers), it looks like this:

theta[1, 1], ..., theta[1, N], theta[2, 1], ..., theta[2, N], ...

where theta[m, n] is the value of parameter n on draw m. The values for column n,

theta[1, n], theta[2, n], ...

will not be contiguous on the file system.

sakrejda commented 5 years ago

re: @bob-carpenter

If you have a 1,000 x 1,000 matrix, then you have 10^6 columns in the .csv file but you only have to seek to find the offset in each row once before you read the entire matrix so even though the Stan program might have 10^6 HMC parameters, it will only have to have one named parameters so the major slowdown is with reading it is paid once per named parameter. Does that make sense?

bob-carpenter commented 5 years ago

I suggest you just try it at this point. The seek time's going to depend on the file system---it'll be much faster on SSDs and much slower on most shared, networked file systems.

The time may wind up being dominated by the ASCII to double conversions, which are something like 100 times slower than binary reads.

On Jun 2, 2019, at 1:36 PM, Krzysztof Sakrejda notifications@github.com wrote:

re: @bob-carpenter

If you have a 1,000 x 1,000 matrix, then you have 10^6 columns in the .csv file but you only have to seek to find the offset in each row once before you read the entire matrix so even though the Stan program might have 10^6 HMC parameters, it will only have to have one named parameters so the major slowdown is with reading it is paid once per named parameter. Does that make sense?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

sakrejda commented 5 years ago

I did try it, we discussed it extensively on discourse, including benchmarks by multiple people with different approaches! The discussion on discourse was here: https://discourse.mc-stan.org/t/alternative-csv-reader/4494/19

I'm telling you what we found from experience. There's no fundamental problem to reading samples from .csv files fast, there are no fundamental problems (other than loss of precision) to doing the conversion from text to double. If you do it right it is fast and efficient even on non-SSD disks with absurdly large .csv files (1 million columns is fine). Extracting individual columns works fine. Extracting sets of columns under one Stan parameter name works fine.

There was no interest in incorporating those approaches into rstan or cmdstan but that doesn't mean they don't work.

bob-carpenter commented 5 years ago

There was no interest in incorporating those approaches into rstan or cmdstan

I'm interested!

I think it was more that there was a lot of conflicting opinion on how to do it combined with lack of resolve for refactoring the current approach, which persists to this day, despite nobody liking it (I'm talking about the CSV reader). @roualdes is working on cleaning up some of the stuff downstream from the reader and I think @mitzimorris may eventually get around to tackling the reader.

mitzimorris commented 5 years ago

I'm interested!

sakrejda commented 5 years ago

Cool, I replied by e-mail so we weren't hijacking this thread.

roualdes commented 5 years ago

Based on this threads discussion, I've updated my proposed functional spec. You can find it as a branch on my fork of Stan.

This next iteration follows Michael's approach number 3, found in his comment on May 3 and beginning with

This raises one more possibility — the analysis functions, including the splitting, operate on a single parameter at a time.

The really nice thing about this idea, and some of Michael's follow up points, is that splitting is minimally costly as it only involves copying pointers, but not draws themselves. The number of pointers copied is twice the number of chains.

One not nice thing about my implementation, is that I still can't see how to maintain the arguments std::vector<const double*> draws, std::vector<size_t> sizes and allow the composition of compute_effective_sample_size and split_chains. Instead, I have two functions compute_effective_sample_size and compute_split_effective_sample_size. The latter function splits and then calls the first, so maintenance shouldn't be overly burdensome.

What do you think?

betanalpha commented 5 years ago

One not nice thing about my implementation, is that I still can't see how to maintain the arguments std::vector<const double*> draws, std::vector sizes and allow the composition of compute_effective_sample_size and split_chains. Instead, I have two functions compute_effective_sample_size and compute_split_effective_sample_size. The latter function splits and then calls the first, so maintenance shouldn't be overly burdensome.

This is exactly what I had recommend way back when, so I’m definitely on board!

bob-carpenter commented 5 years ago

std::vector requires copies to split. If you use Eigen::Map or raw pointers under the hood, then we can get away with views and can avoid the copying involved in creating two new std::vector objects.

On Jun 13, 2019, at 12:43 PM, Michael Betancourt notifications@github.com wrote:

One not nice thing about my implementation, is that I still can't see how to maintain the arguments std::vector<const double*> draws, std::vector sizes and allow the composition of compute_effective_sample_size and split_chains. Instead, I have two functions compute_effective_sample_size and compute_split_effective_sample_size. The latter function splits and then calls the first, so maintenance shouldn't be overly burdensome.

This is exactly what I had recommend way back when, so I’m definitely on board! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

roualdes commented 5 years ago

std::vector requires copies to split

Yes, but only copies of the pointers to the draws.

If you use Eigen::Map or raw pointers under the hood

If you're suggesting not to store a vector of raw pointers as I'm currently doing, then doesn't this imply that we only operate on one chain at a time?

I'm concerned that this would lead to a poor design.

betanalpha commented 5 years ago

Agreed! The vectors of pointers will be cheap to pass along and copy so we shouldn’t have to worry about any overhead.

On Jun 13, 2019, at 9:30 PM, Edward A. Roualdes notifications@github.com wrote:

std::vector requires copies to split

Yes, but only copies of the pointers to the draws.

If you use Eigen::Map or raw pointers under the hood

If you're suggesting not to store a vector of raw pointers as I'm currently doing, then doesn't this imply that we only operate on one chain at a time?

I'm concerned that this would lead to a poor design.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2752?email_source=notifications&email_token=AALU3FRBN2W6AAMRETJ7O23P2LYB3A5CNFSM4HHKZHUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXVOVUQ#issuecomment-501934802, or mute the thread https://github.com/notifications/unsubscribe-auth/AALU3FS5DXGS3TKORI6EDGDP2LYB3ANCNFSM4HHKZHUA.

bob-carpenter commented 5 years ago

I'm probably misunderstanding the proposal. What I'd like to avoid is somthing like this:

std::vector<double> theta(1000);  // 1000 draws from a chain
std::vector<double> theta1(theta.begin(), theta.begin() + 500);
std::vector<double> theta2(theta.begin() + 500, theta.end());

as the latter two statement introduce copies. The copies aren't super expensive, but we want to reduce memory pressure as much as possible in code to improve caching behavior and reduce memory overhead (though the overhead's unlikely to be a problem if this is done one parameter at a time).

We could instead pass pairs of being/end iterators to the R-hat algorithm (theta1.begin() + 500, theta1.end()) to avoid copying. ESS doesn't break up the chains R-hat style, I don't think, though it does use the R-hat variance estimate.

It may not be worth the extra hassle to do this, but in the past, our performance on these analysis algorithms has not been great.