openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
721 stars 38 forks source link

[PRE REVIEW]: A Parametric Method for Generating Synthetic Data #2037

Closed whedon closed 4 years ago

whedon commented 4 years ago

Submitting author: @SidharthMacherla (Sidharth Macherla) Repository: https://github.com/SidharthMacherla/conjurer Version: v1.0.0 Editor: Pending Reviewer: Pending

Author instructions

Thanks for submitting your paper to JOSS @SidharthMacherla. Currently, there isn't an JOSS editor assigned to your paper.

@SidharthMacherla if you have any suggestions for potential reviewers then please mention them here in this thread (without tagging them with an @). In addition, this list of people have already agreed to review for JOSS and may be suitable for this submission (please start at the bottom of the list).

Editor instructions

The JOSS submission bot @whedon is here to help you find and assign reviewers and start the main review. To find out what @whedon can do for you type:

@whedon commands
whedon commented 4 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf
whedon commented 4 years ago
Software report (experimental):

github.com/AlDanial/cloc v 1.84  T=0.03 s (507.4 files/s, 25972.0 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
R                                9             38            122            194
Markdown                         4             37              0            138
TeX                              2             13              0            123
Rmd                              1             32             95             27
-------------------------------------------------------------------------------
SUM:                            16            120            217            482
-------------------------------------------------------------------------------

Statistical information for the repository '2037' was gathered on 2020/01/22.
No commited files with the specified extensions were found.
whedon commented 4 years ago
Reference check summary:

OK DOIs

- 10.2307/1884282 is OK
- 10.1177/1847979018808673 is OK
- 10.18637/jss.v074.i11 is OK

MISSING DOIs

- None

INVALID DOIs

- None
whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

labarba commented 4 years ago

👋 @SidharthMacherla — thanks for your submission to JOSS. From a quick inspection of this submission it's not entirely obvious that it meets our submission criteria. In particular, this item:

  • Your software should have an obvious research application

As described, it looks like the target applications are in data science in commercial settings. Can you elaborate how this would be used in research settings, and how might cite it?

SidharthMacherla commented 4 years ago

Hi @labarba , This paper details synthetic data methodology. The concept of synthetic data generation has application in research setting. To elaborate, synthetic data generation is used in testing new algorithms proposed by researchers. Following are some examples that could be cited. I will be happy to help you with any further clarifications.

  1. The following publication (page 230) speaks about how synthetic time series data is used as validation data set. @Inbook{Lachtermacher1994, author="Lachtermacher, Gerson and Fuller, J. David", editor="Hipel, Keith W. and McLeod, A. Ian and Panu, U. S. and Singh, Vijay P.", title="Backpropagation in Hydrological Time Series Forecasting", bookTitle="Stochastic and Statistical Methods in Hydrology and Environmental Engineering: Time Series Analysis in Hydrology and Environmental Engineering", year="1994", publisher="Springer Netherlands", address="Dordrecht", pages="229--242", abstract="One of the major constraints on the use of backpropagation neural networks as a practical forecasting tool, is the number of training patterns needed. We propose a methodology that reduces the data requirements. The general idea is to use the Box-Jenkins models in an exploratory phase to identify the ``lag components'' of the series, to determine a compact network structure with one input unit for each lag, and then apply the validation procedure. This process minimizes the size of the network and consequently the data required to train the network. The results obtained in four studies show the potential of the new methodology as an alternative to the traditional time series models.", isbn="978-94-017-3083-9", doi="10.1007/978-94-017-3083-9_18", url="https://doi.org/10.1007/978-94-017-3083-9_18" }

  2. This paper speaks about the need for synthetic data in benchmarking pattern recognition and data mining methods. @article{FRASCH20111523, title = "A Bayes-true data generator for evaluation of supervised and unsupervised learning methods", journal = "Pattern Recognition Letters", volume = "32", number = "11", pages = "1523 - 1531", year = "2011", issn = "0167-8655", doi = "https://doi.org/10.1016/j.patrec.2011.04.010", url = "http://www.sciencedirect.com/science/article/pii/S0167865511001103", author = "Janick V. Frasch and Aleksander Lodwich and Faisal Shafait and Thomas M. Breuel", keywords = "Synthetic data generation, Benchmarking, Experimental proofs", abstract = "Benchmarking pattern recognition, machine learning and data mining methods commonly relies on real-world data sets. However, there are some disadvantages in using real-world data. On one hand collecting real-world data can become difficult or impossible for various reasons, on the other hand real-world variables are hard to control, even in the problem domain; in the feature domain, where most statistical learning methods operate, exercising control is even more difficult and hence rarely attempted. This is at odds with the scientific experimentation guidelines mandating the use of as directly controllable and as directly observable variables as possible. Because of this, synthetic data possesses certain advantages over real-world data sets. In this paper we propose a method that produces synthetic data with guaranteed global and class-specific statistical properties. This method is based on overlapping class densities placed on the corners of a regular k-simplex. This generator can be used for algorithm testing and fair performance evaluation of statistical learning methods. Because of the strong properties of this generator researchers can reproduce each others experiments by knowing the parameters used, instead of transmitting large data sets." }

  3. The following paper describes how synthetic data generation was needed to isolate cyclical, trend and noise components. Of these, the current paper has used cyclicality and trend. @article{doi:10.1287/mnsc.13.4.B202, author = {Kirby, Robert M.}, title = {A Comparison of Short and Medium Range Statistical Forecasting Methods}, journal = {Management Science}, volume = {13}, number = {4}, pages = {B-202-B-210}, year = {1966}, doi = {10.1287/mnsc.13.4.B202},

URL = { https://doi.org/10.1287/mnsc.13.4.B202

}, eprint = { https://doi.org/10.1287/mnsc.13.4.B202

} , abstract = { Exponential Smoothing, Moving Average, and Least Squares forecasting models were tested by simulating their operation on seven years of actual data for various sewing machine product groups. The relative accuracy of the forecasts varied according to the length of the period being forecasted and the characteristics of the data. Tests were also conducted on synthetic series designed to isolate the cyclical, trend and noise components. For the series tested, the Exponential Smoothing and Moving Average methods were about equal in overall performance for intermediate range forecasts (next six months' demand). For the short range (next month's demand), the Exponential Smoothing gave slightly better over-all results. The difference in relative performance between the Exponential Smoothing and Moving Average methods for intermediate versus short range forecasts appears to be due to a subcomponent identified as “caused noise.” } }

labarba commented 4 years ago

@SidharthMacherla — After a pre-review of your submission, and an assessment by the editorial board at large, we find that it does not meet our submission requirements.

In particular, while the software seems useful, it falls under the 'Minor utility' category, and we therefore decided not to put it through review.

Note the JOSS general eligibility criterion: “The software should be a significant contribution to the available open source software that either enables some new research challenges to be addressed or makes addressing research challenges significantly better”

Thank you for considering JOSS as a venue for your software, and I hope you'll submit other software with research application in the future.

SidharthMacherla commented 4 years ago

Thank you for your time and feedback. I believe I can rework on my paper to articulate how the package has a wider application in research. I will come back with a fresh submission when I am ready. Thank you.