pierreroudier / clhs

A R implementation of the conditioned Latin Hypercube Sampling method
12 stars 9 forks source link

C++ version of clhs #15

Closed kdaust closed 3 years ago

kdaust commented 4 years ago

This is an adaptation the clhs package to include use of the main function written in C++; it is significantly faster than the R version (~ 150 times). I've run lots of tests, but it's quite likely there are some errors. I've been able to build the package successfully on Ubuntu and Windows, but haven't tried a mac.

pierreroudier commented 4 years ago

I got @dylanbeaudette to check the PR on Mac. He is getting the following error:

clang: error: unsupported option '-fopenmp'

Same issue as this one?

kdaust commented 4 years ago

Hmm. Yes that looks like the same issue. Mac OS is generally the most difficult to compile on, and from what I understand it usually is due to out of date Xcode or gfortran distributions. Apparently there are also some issues with RcppArmadillo (which this uses) and OS Catalina. I found a mac here - I'm going to test it on there tonight and see what crops up. Thanks @dylanbeaudette !

pierreroudier commented 3 years ago

@kdaust Thanks for your work Kiri (and HNY)!

Did you have a chance to test on Mac? One possible way to do so, if you don't own a Mac (I don't!), is maybe using the rhub package.

Let me know if you want some help with that.

kdaust commented 3 years ago

Happy New Years! My apologies for the delay here. I had attempted to find a mac to try it on (I don't own one) and I managed to borrow one, but it was too old to update to the latest OS and get the necessary XCode tools. I didn't know about rhub - thanks for the info, I'll try that! It looks like the check_for_cran would be the necessary function.

On a separate note, I added a new parameter to the C++ version called possible.samples, which lets you specify the indices from which sampling is allowed, but still tries to match the distribution of the entire dataset. I originally added it because we were designing some transect sampling plans, and although we could only sample from the transects, wanted to match the entire distribution as much as possible. I realise that because it is restricting the sampling, it won't be able to create a proper hypercube. But for our purposes, it seemed to do a reasonably good job. Let me know if you think this is a worthwhile option to keep in - if not, I can remove it and and just keep a personal version. If you do think we should keep it, we may need to include it in the R version too, otherwise it will fail with use.cpp = F if possible.samples is specified.

I will have a look at rhub and see if I can figure that out. Thanks!

pierreroudier commented 3 years ago

@kdaust You seem to have deleted the vignette file (intro-clhs.Rmd) in your last commit, which I assume is a mistake (?)

kdaust commented 3 years ago

@pierreroudier I was having issues with the vignette on rhub so I removed it temporarily. I've been working with RHub this afternoon and then I'll add it back in. Sorry about that!

pierreroudier commented 3 years ago

@kdaust Also interesting to see your possible.samples option, it sounds like a useful addition, and follows a similar idea than the include option (the latter allows to force some samples to be included -- eg sites that have already been collected).

If we end up including it we might try and "harmonise" the way we name those options -- eg possible.samples and included.samples, or something like that. Happy to hear suggestions if you have some!

kdaust commented 3 years ago

@pierreroudier yes indeed! I think harmonising labels would be excellent. I just found a bug in the C++ part that sometimes comes up with the include parameter, and I'm working on that currently. I'll push it and add the vignette soon hopefully!

kdaust commented 3 years ago

Hi all! Sorry for the delay! Midterm season has been upon me. I've done a bunch of updates - I found and fixed a bug in the C++ code which was making the cost-constrained version converge slower than it should have. There were a few things in the vignette I needed to update to get it to compile and I fixed a couple tests that were failing. I just ran the check_for_cran() from rhub, and it was able to compile and load on all cran OSs - which includes macOS of course. I wonder if @dylanbeaudette could try building it again? It's possible that there are XCode dependencies which the cran system would have but a personal mac may not by default. Let me know how things go!

dylanbeaudette commented 3 years ago

Works for me.

A few minor warnings is all.

In file included from CppLHS.cpp:13:
In file included from /Users/sgtpepp/Library/R/4.0/library/RcppArmadillo/include/RcppArmadillo.h:34:
In file included from /Users/sgtpepp/Library/R/4.0/library/Rcpp/include/Rcpp.h:57:
/Users/sgtpepp/Library/R/4.0/library/Rcpp/include/Rcpp/DataFrame.h:136:18: warning: unused variable 'data' [-Wunused-variable]
            SEXP data = Parent::get__();
                 ^
CppLHS.cpp:412:22: warning: '&&' within '||' [-Wlogical-op-parentheses]
    if(delta_obj > 0 && runif(1,0,1)[0] >= metropolis || runif(1,0,1)[0] >= metropolis_cost){
       ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~
CppLHS.cpp:412:22: note: place parentheses around the '&&' expression to silence this warning
    if(delta_obj > 0 && runif(1,0,1)[0] >= metropolis || runif(1,0,1)[0] >= metropolis_cost){
                     ^
kdaust commented 3 years ago

Yay! Thanks @dylanbeaudette!! I think I can fix those this evening...

dylanbeaudette commented 3 years ago

Sweet! I'm sorry that it took me so very long to test this. Your additions / improvements to clhs will go straight into production work at the USDA-NRCS!

kdaust commented 3 years ago

Wow that's super cool! I've fixed all warnings but one - the first one you have, the unused data variable, is actually in the Rcpp code, not the code I wrote. Oddly enough, that error doesn't come up on my machine. I believe macOS uses gcc to compile C code, instead of g++ on windows; that may explain why. It's also possible that if you're using an older version of Rcpp it might have changed slightly.

@pierreroudier - one thought I had re: harmonising parameter names, is that I could change the possible.sample to be the opposite (i.e. specify row numbers for places that can't be sampled instead of places that can be sampled), and then we could call it exclude, which would match include. Hard to know which is easier I suppose - in our case, we had the indices for the points that could be sampled, which is why I wrote it that way. But it would be easy enough to reverse. Thoughts?

pierreroudier commented 3 years ago

@kdaust I think we are ready to merge the PR as a new major clhs release. Hopefully I will have time to prepare it next week (sorry, I have been a little busy this last month!).

However, before we do that, I have received this PR which corrects a significant bug, do you think you could patch your PR accordingly?

Thanks again!

kdaust commented 3 years ago

@pierreroudier Awesome! That's great :)

And yes, absolutely! I thought there was something odd about the logic there, but never gave it a huge amount of thought. I'll try and get to it on Monday.

Thanks!

pierreroudier commented 3 years ago

@kdaust I have been a bit slack merging this PR, seems like it created conflicts in yours. Sorry!

kdaust commented 3 years ago

Just fixed the logic in the C++ version. It's a slightly different implementation than Clifford's since the C version is structured a bit differently, but the idea is the same. No problem - I'll see if I can fix the conflicts

kdaust commented 3 years ago

Hang on, I think I broke something...just incase you were going to merge, don't yet please

pierreroudier commented 3 years ago

All good @kdaust.

My plan is:

  1. release a new minor version that includes @david-clifford's PR
  2. have a thorough check, sort out the possibly new parameters names, and then merge your PR and release a new major version of the package
kdaust commented 3 years ago

@pierreroudier Sounds good. I think I potentially found an error with @david-clifford's update - if I try clhs(mtcars,size=5,use.cpp=F), I get an error: Error in cut.default(x[[1]], breaks = x[[2]], labels = FALSE, include.lowest = TRUE) : 'breaks' are not unique. My guess is it's only an issue when there aren't many samples.

david-clifford commented 3 years ago

Hey Kiri - I believe your guess is correct. The mtcars dataset has a few very coarse numeric fields like cyl, vs, am, and gear which each have fewer than 5 unique values. Fixing this might require automatically detecting when this might occur and updating the eta value accordingly on a per-column basis. In terms of pragmatic solutions for this edge case, using jitter to nudge these numeric values to distinct values will overcome the issue but wouldn't be repeatable. Alternatively - what about treating such fields as factors internally so we get the right proportion of each entry in the sample. This would ignore the correlation structure though. Either way, it might worth flagging as a warning with the user.

kdaust commented 3 years ago

Hi David - I like your idea about converting to a factor. In practice, I doubt that there are many cases where a continuous variables will have fewer values than the requested number of samples, although I suppose it is a possibility if the user wants a large proportion of the population sampled. It would be fairly simple to check each variable and convert to a factor if it has too few values. In the C++ version, the algorithm I'm using for ordering the bins basically skip bins that have no width, so while it might not create a great sample for that variables, it doesn't crash. If we end up deciding that factors are the way to go, I'd be happy to add that to the C++ part so it's consistent.

pierreroudier commented 3 years ago

@kdaust @david-clifford I think in this case we could issue an error to the user, saying that the number of unique values in column X in less than the number of samples that has been asked for? And add a suggestion about converting to a factor?

This would be the first check donw on input data basically

pierreroudier commented 3 years ago

@david-clifford @kdaust Hi both, and sorry for the silence. I finally managed to get a bit of time on the package. Here's the proposed plan:

  1. Fix the package and implement @david-clifford's suggestions (that were in his two PRs). See discussion in https://github.com/pierreroudier/clhs/pull/20#issuecomment-832309983
  2. Once we have a working R version of this, I will prepare a CRAN release
  3. This version will be the basis of the C++ version, then I will merge this PR
  4. Finally, we will be able to prepare the first release of the much improved, C++ version!

Any thoughts let me know,

Pierre

david-clifford commented 3 years ago

Thanks for the update Pierre. Looks good to me. Happy to dive more deeply in if that will help you at some point.

-d

On Tue, May 4, 2021 at 4:31 PM Pierre Roudier @.***> wrote:

@david-clifford https://github.com/david-clifford @kdaust https://github.com/kdaust Hi both, and sorry for the silence. I finally managed to get a bit of time on the package. Here's the proposed plan:

  1. Fix the package and implement @david-clifford https://github.com/david-clifford's suggestions (that were in his two PRs). See discussion in #20 (comment) https://github.com/pierreroudier/clhs/pull/20#issuecomment-832309983
  2. Once we have a working R version of this, I will prepare a CRAn release
  3. This version will be the basis of the C++ version, then I will merge this PR
  4. Finally, we will be able to prepare the first release of the much improved, C++ version!

Any thoughts let me know,

Pierre

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pierreroudier/clhs/pull/15#issuecomment-832318858, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVZY3AU2GTMO2ADTQKW4KDTMB7WZANCNFSM4R5DSLHA .

kdaust commented 3 years ago

Thanks Pierre! Sounds good to me. I believe I have already implemented David's fix in the C++ version but when we get to that part I'll double check that everything is lining up properly.

On Tue, May 4, 2021 at 4:31 PM Pierre Roudier @.***> wrote:

@david-clifford https://github.com/david-clifford @kdaust https://github.com/kdaust Hi both, and sorry for the silence. I finally managed to get a bit of time on the package. Here's the proposed plan:

  1. Fix the package and implement @david-clifford https://github.com/david-clifford's suggestions (that were in his two PRs). See discussion in #20 (comment) https://github.com/pierreroudier/clhs/pull/20#issuecomment-832309983
  2. Once we have a working R version of this, I will prepare a CRAn release
  3. This version will be the basis of the C++ version, then I will merge this PR
  4. Finally, we will be able to prepare the first release of the much improved, C++ version!

Any thoughts let me know,

Pierre

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pierreroudier/clhs/pull/15#issuecomment-832318858, or unsubscribe https://github.com/notifications/unsubscribe-auth/AID6TTWGVGX2JCQWWPO2TSDTMB7WZANCNFSM4R5DSLHA .

pierreroudier commented 3 years ago

@david-clifford Please see the latest version committed today, if you have a bit of time to have a look at it: it is ready to submit to CRAN.

david-clifford commented 3 years ago

I got the following when I tried to install the package the second time. The first time I ran it I was asked to update a bunch of libraries and they all installed fine. Not sure how to proceed.

devtools::install_github("pierreroudier/clhs") Downloading GitHub repo pierreroudier/clhs@HEAD ✓ checking for file ‘/tmp/Rtmpd3D5ax/remotes35a4a516e5ca1/pierreroudier-clhs-8c28665/DESCRIPTION’ ... ─ preparing ‘clhs’: ✓ checking DESCRIPTION meta-information ... ─ checking for LF line-endings in source and make files and shell scripts ─ checking for empty or unneeded directories ─ building ‘clhs_0.8-0.tar.gz’ Warning: invalid uid value replaced by that for user 'nobody' Warning: invalid gid value replaced by that for user 'nobody'

Installing package into ‘/usr/___/4.0’ (as ‘lib’ is unspecified)

On Wed, May 5, 2021 at 9:38 PM Pierre Roudier @.***> wrote:

@david-clifford https://github.com/david-clifford Please see the latest version committed today, if you have a bit of time to have a look at it: it is ready to submit to CRAN.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pierreroudier/clhs/pull/15#issuecomment-833217958, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVZY3D5GOV43ZUBRKQNC63TMIMNFANCNFSM4R5DSLHA .

david-clifford commented 3 years ago

Ignore previous email please. I reinstalled stringi and this issue disappeared. I was able to install without problems and I ran a few examples without any trouble.

On Thu, May 6, 2021 at 10:54 AM David Clifford @.***> wrote:

I got the following when I tried to install the package the second time. The first time I ran it I was asked to update a bunch of libraries and they all installed fine. Not sure how to proceed.

devtools::install_github("pierreroudier/clhs") Downloading GitHub repo pierreroudier/clhs@HEAD ✓ checking for file ‘/tmp/Rtmpd3D5ax/remotes35a4a516e5ca1/pierreroudier-clhs-8c28665/DESCRIPTION’ ... ─ preparing ‘clhs’: ✓ checking DESCRIPTION meta-information ... ─ checking for LF line-endings in source and make files and shell scripts ─ checking for empty or unneeded directories ─ building ‘clhs_0.8-0.tar.gz’ Warning: invalid uid value replaced by that for user 'nobody' Warning: invalid gid value replaced by that for user 'nobody'

Installing package into ‘/usr/___/4.0’ (as ‘lib’ is unspecified)

  • installing source package ‘clhs’ ... using staged installation R inst byte-compile and prepare package for lazy loading Error in dyn.load(file, DLLpath = DLLpath, ...) : unable to load shared object '/usr/___/4.0/stringi/libs/stringi.so': libicui18n.so.63: cannot open shared object file: No such file or directory Calls: ... namespaceImport -> loadNamespace -> library.dynam -> dyn.load Execution halted ERROR: lazy loading failed for package ‘clhs’
  • removing ‘/usr/l___/4.0/clhs’
  • restoring previous ‘/usr/___/4.0/clhs’ Error: Failed to install 'clhs' from GitHub: (converted from warning) installation of package ‘/tmp/Rtmpd3D5ax/file35a4a40ec0b42/clhs_0.8-0.tar.gz’ had non-zero exit status

On Wed, May 5, 2021 at 9:38 PM Pierre Roudier @.***> wrote:

@david-clifford https://github.com/david-clifford Please see the latest version committed today, if you have a bit of time to have a look at it: it is ready to submit to CRAN.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pierreroudier/clhs/pull/15#issuecomment-833217958, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVZY3D5GOV43ZUBRKQNC63TMIMNFANCNFSM4R5DSLHA .

pierreroudier commented 3 years ago

Ignore previous email please. I reinstalled stringi and this issue disappeared. I was able to install without problems and I ran a few examples without any trouble.

Thanks David! I'll go ahead and tag the version 0.8-0, and prepare for CRAN release if you're happy with that.

pierreroudier commented 3 years ago

Version 0.8-1 is up on CRAN!

Note some significant news, such as the support for sf objects.

pierreroudier commented 3 years ago

@kdaust Do you think we can start resolving the conflicts below, and try the integration of your C++ code in the next release?

kdaust commented 3 years ago

@pierreroudier that sounds great! I'll try and get to them this weekend...

pierreroudier commented 3 years ago

@kdaust Perfect! Let me know if I can help.

kdaust commented 3 years ago

@pierreroudier one thing we should think about is the possible.sample parameter I added in the C++ version - whether we want to keep that in, and if so whether we should implement it in the R version and if we want to give it a different name (e.g. could be the opposite - exclude - which would match the include param)

kdaust commented 3 years ago

@pierreroudier I was just trying to resolve the conflicts (on the github editor) but I then wasn't able to commit the resolutions. I'm fairly unexperienced with github - do you know what the correct way to do this is? Thanks!

pierreroudier commented 3 years ago

@kdaust

one thing we should think about is the possible.sample parameter I added in the C++ version - whether we want to keep that in, and if so whether we should implement it in the R version and if we want to give it a different name (e.g. could be the opposite - exclude - which would match the include param)

I think the best would be to use include, which is currently identifies which samples MUST be retained (eg if they already have been sampled from a previous survey).

I was just trying to resolve the conflicts (on the github editor) but I then wasn't able to commit the resolutions. I'm fairly unexperienced with github - do you know what the correct way to do this is? Thanks!

Yes, it is a bit of a struggle isn't it!

I would recommend fetching the upstream changes (which are on my repo) first, by using the fetch button.

Then whatever merges that could not be done will need to be done and cleaned manually -- this is something I do in RStudio. See here in particular bullet (4): look for the conflict markers, and pick manually which version needs to be retained.

Hope this helps!

pierreroudier commented 3 years ago

@kdaust Actually I went ahead and did the merge conflicts myself. I didn't realise the online editor was so well done!

kdaust commented 3 years ago

@pierreroudier

I think the best would be to use include, which is currently identifies which samples MUST be retained (eg if they already have been sampled from a previous survey).

Right, but the possible.samples is slightly different - I implemented it because we were designing a sampling plan where we could only sample along transects, but wanted to match the overall variable space as closely as possible. So we only wanted the clhs to pick samples along the transect, but wanted it to use the full variable space in the optimisation. By specifying the possible.sample as the raster cells along the transect, it only picks from those, but the objective function is calculated on the full dataset. I realise that this is a slightly more unusual sampling strategy so if you don't think it has a place in the package that's totally fine.

Thanks for fixing the merge conflicts! Yes the online editor is excellent. I believe I need to redo the C++ test-that checks because there have been some changes - I'll try and get to that later today.

pierreroudier commented 3 years ago

Right, but the possible.samples is slightly different - I implemented it because we were designing a sampling plan where we could only sample along transects, but wanted to match the overall variable space as closely as possible. So we only wanted the clhs to pick samples along the transect, but wanted it to use the full variable space in the optimisation. By specifying the possible.sample as the raster cells along the transect, it only picks from those, but the objective function is calculated on the full dataset. I realise that this is a slightly more unusual sampling strategy so if you don't think it has a place in the package that's totally fine.

Thanks for that, I was not completely on top of what this was doing.

I suppose we need to make the distinction between include and possible.samples very clear. We could refactor those into can.include and must.include maybe? What do you think?

kdaust commented 3 years ago

@pierreroudier I think that's a great idea. I can update the C version to use those names. must.include wouldn't be implemented in the R version but I suppose that's ok - could just throw a warning

kdaust commented 3 years ago

@pierreroudier I've updated the parameter names and fixed the test errors. Let me know what the next step should be :)

kdaust commented 3 years ago

@pierreroudier I've fixed the documentation mismatches. Just ran R CMD check and it succeeded on my machine.

pierreroudier commented 3 years ago

@pierreroudier I've fixed the documentation mismatches. Just ran R CMD check and it succeeded on my machine.

Legend, thank you. I have approved the commits to run the checks.

kdaust commented 3 years ago

@pierreroudier the macOS fail seems to be due to ERROR: dependency ‘sf’ is not available for package ‘clhs’ - do you know why that would be?

pierreroudier commented 3 years ago

@pierreroudier the macOS fail seems to be due to ERROR: dependency ‘sf’ is not available for package ‘clhs’ - do you know why that would be?

Ive seen that, it's really weird. I think it fails to install sf because gdal-config can't be found.

I have changed the Mac test configuration in my latest commit, you could try that.

kdaust commented 3 years ago

@pierreroudier I've added the section from your commit - hopefully that does it! FYI, I just ran the macOS build using the rhub package and is was successful, although it uses highsierra, not big sur. It seems that sf might have some issues with big sur so maybe that's the cause of the issue.

kdaust commented 3 years ago

@pierreroudier let me know if there's more I should do!

pierreroudier commented 3 years ago

@pierreroudier let me know if there's more I should do!

Thanks!

Could you just confirm one thing: if the use.ccp is turned to FALSE, the code then uses the (unaltered but painfully slow) R implementation, right?

I'm mainly asking as I want to write a "check script" to make sure the R and C++ version yield similar results.

kdaust commented 3 years ago

Yes that is correct! One thing to note is that since C++ uses a different random number generator, and uses the randomness in slightly different ways, it's not possible to get it to produce exactly the same result as the R version. However, it should reach similar values of the objective function at the end, so that's what I was mostly using for testing