kgoldfeld / simstudy

simstudy: Illuminating research methods through data generation
https://kgoldfeld.github.io/simstudy/
GNU General Public License v3.0
81 stars 7 forks source link

Create a rejection-sampling function to sample from non-standard distributions #88

Closed kgoldfeld closed 5 months ago

kgoldfeld commented 4 years ago

Maybe nice to have an internal function that provides the ability to generate data from a density function specified as a function.

assignUser commented 3 years ago

75 #71

kgoldfeld commented 6 months ago

Here is some code that will allow users to specify the shape of a function (by inputting data), and then sampling from the density estimated from those data:

data <- c(1, 2, 2, 3, 4, 4, 4, 5, 6, 6, 7, 7, 7, 8, 9, 10, 10, 10, 10)

# Generate the density estimate
density_estimate <- density(data, n = 10000, from = min(data), to = max(data))
plot(density_estimate, main="Customized Kernel Density Estimation", xlab="Data", ylab="Density")

# Create a function to sample from the density estimate
sample_from_density <- function(density_estimate, n) {
  x <- density_estimate$x
  y <- density_estimate$y

  # Normalize the density values to create a probability distribution
  probabilities <- y / sum(y)

  # Sample from the x values according to the probabilities
  sample(x, size=n, replace=TRUE, prob=probabilities)
}

# Number of samples you want to draw
n_samples <- 100000

# Sample from the density estimate
samples <- sample_from_density(density_estimate, n_samples)

# Plot the histogram of the samples and the density estimate
hist(samples, breaks=60, probability=TRUE, main="Sampled Data and Density Estimate", xlab="Value", col="lightblue")
lines(density_estimate, col="red", lwd=2)
kgoldfeld commented 6 months ago

I've created genDataDensity and addDataDensity so that it is now possible to generate data from an arbitrary distribution specified by a vector of integers:

base_data <- c(1, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 5, 6, 6, 7, 7, 7, 8, 9, 10, 10, 10, 10, 10)

dd <- genDataDensity(10000, dataDist = base_data, varname = "x1")

Plotting the generated data with the original density:

de <- density(data, n = 100000)
de <- data.table(x = de$x, y = de$y)

ggplot(data = dd, aes(x=x1)) +
  geom_histogram(aes(y = after_stat(count / sum(count))), binwidth = 1, fill = "lightblue", color = "black") +
  geom_line(data = de, aes(x = x, y = y), color = "red") +
  theme(panel.grid = element_blank()) +
  scale_y_continuous(name = "density\n")