Midterm report feedback

The report is very well written and below are some of my thoughts on future improvements:

I like the clear descriptions for each predictor variables, but I think it will be nice if you can also tell us the range and other basic statistics (min, max, average, percentile etc.) of those continuous variables and also show some examples of the categories in the categorical variables, so that we can get a better sense of how the data looks like. It will also be helpful to tell use the total number of rows, so that we know how big the data set is.

Since the whole experiment and commercialize process is kinda complicated, it will be nice if you can include something like a flowchart to visualize the process.

In terms of the cross validation sets, why exactly do you choose to create different folds in this way? Shouldn't the "test set" be called "validation set"? Because in normal k-fold validation, we iteratively use one of the k fold as cross validation set and the rest k-1 folds as training, so that we can fine tune the model and the parameters we pick, but according to the table, you train the model using previous years data and cross validate it on future data.

You also mention in the report that you would like to use classification to predict whether a variety will be commercialized then to use regression to figure out the sales volume, I think that's a good idea. And since there are not too many predictor variables, I think you guys can really try out all kinds of feature transformation and create new features based on what you have without worrying too much about overfitting (if n >> d) and you can probably also merge your data with data from other sources, for example, I think we can find something like average temperature for each year at each location and something like that to incorporate in to the original dataset.

wangzilongri / SoybeanProject4741

Midterm report feedback #7