statistikat / simPop

Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information
30 stars 7 forks source link

Feature/xgboost #16

Closed sironimo closed 3 years ago

sironimo commented 3 years ago

As part of my master thesis, I've added the XGBoost algorithm as a new method to simContinuous and simCategorical. The XGBoost is a modern and scalable gradient boosting implementation (XGBoost Documentation).

alexkowa commented 3 years ago

Hi @matthias-da, @JohannesGuss and I work with simPop write now and also made changes to simCategorical. Our project is finalised by the end of August. so I would propose that Johannes and I merge the xgboost pull request ~ in mid September?

Alex

matthias-da commented 3 years ago

That would be ideal, thank you!

We tend to submit the extension with xgboost and ANN's (ANN is not completed yet) for the uRos conference and thus for the R journal. Depending on your achievements, this could be also a joint paper - a more general contribution, considering your improvements + our extensions on xgboost and ANN's. What do you think about?

I propose also some other changes to be made, whereby the first one could be too much work and thus out of scope.

1) Siro implemented it already roughly: the sequence/order of variables to be simulated has a huge impact on the results. This could be significantly reduced and the results to be improved, when doing a second loop, incooperating the already simulated information from the first run. I will discuss this with Johannes today, since I meet him anyhow online. So each part might be repeated once including already information that is simulated in the first run. However, its a bit tricky, lets see where the discussing is going to.

2) simPop has strong limitations when simulating data based on population data (i.e. in general data are outside from the complex survey methodology). It mostly is not straightforward or even results in errors when simulating data from a population. In my point of view, this is a much simpler case than having a survey from a complex survey design. Could be tested and partly enhanced so that the package also works for samples without sampling weights or populations.

I will formulate two issues.

alexkowa commented 3 years ago

1) sounds interesting but quite challenging. 2) we actually had a use case with a population and it worked fine, but probably things can be improved.

But generally, yes submit to uRos, we can see if we extend your article to the R Journal afterwards with our contribution.