450 Proposal Draft Meeting

Meeting Type: F2F Time: 06/02/2020 2-3pm, 4-5pm Participants: Leonie Lu, Yuetong Liu, Peter Han, Yuting Wen What we did: proposal draft discussion What needs to be done next: discuss for proposal about which statistical tool/method (not R function) we are planning to use to address each scientific question.

STAT450 Real Estate Proposal Draft

Leonie Lu, Yuetong Liu, Peter Han, Yuting Wen

Summary Every year, all property owners in BC have to pay property taxes—the single greatest operating expense. The property tax is determined based on property assessment and property tax rate (mill rates). This project will be a value-add to the business in projecting commercial real estate property taxes, and providing tax and assessment insights for different neighborhoods/and or property types. This project will quantify the relationship between property tax and assessment value over the past few years of data, and construct a predictive model to estimate property taxes for a given property for the upcoming year.

Objectives To predict the property tax for a given property in Metro Vancouver for the upcoming year. This may lead to an automated Model/Report to show insights on assessment values and trends for a given area and/or property type. Find the relationship between property tax and assessment value in the past few years. For example, if the assessment has increased/decreased (for an area/municipality), will the property tax increase/decrease as well? If so, by how much?

Data To address our problems, we will use Property Assessment Data (2016-2020) given by our client, and Municipal Budget, which can be found on the BC government website. Variables: All Properties information (including mill rate (2016-2019), assessment, tax class code, area code, asset type, year) in Vancouver (2016-2020) City of Vancouver Budget (2016-2020)

Analysis outline (data analysis specifically for Vancouver) Data cleaning (reconstruct the data set) Filter municipalities in greater vancouver (“Metro Vancouver” district): Burnaby, Coquitlam, Delta, Langley, Maple Ridge, North Vancouver, Pitt Meadows, Port Coquitlam, Port Moody, Richmond, Surrey, Vancouver, White Rock, West Vancouver, Bowen Island, Anmore, Belcarra and Lions Bay. Aggregation of data: Transform the raw data: for each year, sum over data (assessed values for land and improvement, i.e. current year total) within the same municipality and tax class. Since assessed values differ a lot in scales, we decide to transform assessed values for different municipalities and tax classes into percentage change of mean of assessed values. Similarly for the government budget. Create dummy variables for tax class and municipality code. Incorporate past mill rates into the dataset.

Exploratory data analysis: Visualize the data: for each municipality and tax class, plot past mill rates, aggregate assessed value, and government budget against time seperately. Plot percentage change of mill rates, assessed value, government budget against time seperately. Create box plots of mill rates, grouped by region. Test correlations between past mill rates and assessed values and government budget.

Regression Family (Linear and Non-Linear, ex. Elastic net, L1, L2, polynomial) Define %= percentage change. We propose two regression models, with and without “year” as explanatory variable.

%of mill rate = 0 +1year + 2(municipality class) + 3 (tax class) +4(%of assessed value) +5(%of budget)

%of mill rate = 0 + 1(municipality class) + 2 (tax class) +3(%of assessed value) +4(%of budget) (“percentage change” has included the time effect of previous year)

I really like the document as it gives a nice overview of the (at least initial) plan. Some comments that might help make it clearer:

Remember that we are now only interested in prediction (at least for the time being). It might be good to only leave that point in the objectives section.
Just for readability, I would include the variables in a list. Or even better, in a table that includes what type of variable (categorical, quantitative, etc.) each one is.
I would explain a little more about only doing the analysis for Vancouver. It might be important to highlight that Vancouver's real estate laws are different than the rest of BC and so it makes sense to do a Vancouver-specific analysis.
I would leave out the phrase "reconstruct the data set" as it's not entirely clear to me what it means. (Data cleaning/warngling is perfect.)
I really like the exploratory data analysis plan! Although consider using an ordered list to separate the steps.
For the proposed model, I would be careful with using year as a variable because of correlation between different years in the response. You can also mention that you plan to use linear regression + some sort of regularization, which encompasses Elastic Net, Lasso, Ridge, ...

Also, these are just recommendations, so feel free to incorporate what you like.

All in all, great work! If you have any other question or want to follow up in any point, feel free to send me a slack message.

Cheers.

I really like the document as it gives a nice overview of the (at least initial) plan. Some comments that might help make it clearer:

Remember that we are now only interested in prediction (at least for the time being). It might be good to only leave that point in the objectives section.

Just for readability, I would include the variables in a list. Or even better, in a table that includes what type of variable (categorical, quantitative, etc.) each one is.

I would explain a little more about only doing the analysis for Vancouver. It might be important to highlight that Vancouver's real estate laws are different than the rest of BC and so it makes sense to do a Vancouver-specific analysis.

I would leave out the phrase "reconstruct the data set" as it's not entirely clear to me what it means. (Data cleaning/warngling is perfect.)

I really like the exploratory data analysis plan! Although consider using an ordered list to separate the steps.

For the proposed model, I would be careful with using year as a variable because of correlation between different years in the response. You can also mention that you plan to use linear regression + some sort of regularization, which encompasses Elastic Net, Lasso, Ridge, ...

Also, these are just recommendations, so feel free to incorporate what you like.

All in all, great work! If you have any other question or want to follow up in any point, feel free to send me a slack message.

Cheers.

sure!!! we will have another meeting to modify our proposal based on your comment on Monday, Thanks!

Hi everyone,

Good plan overall, and I think that's great advice from Gian Carlo. I have a few additional comments:

I definitely agree with point number 6, and using year as a variable - this would get tricky, since we know subsequent years are correlated to one another, and expect less correlation between years further apart. Since you are using a percent change in assessment values, this is implicitly incorporating year-to-year correlation in this value, and you could consider doing this for other variables also (mill rates?)
From what I understood from Harry, "Vancouver" was the only municipality with special tax laws compared with the rest of the province. That is, Burnaby, Richmond, White Rock, etc. don't have these special laws. That is only my understanding, and should be looked into further to confirm one way or another, but if that's the case, then you want to subset the data to just the "Vancouver" municipality for the Vancouver analysis.
You might want to consider (and add to this plan) what the proposed plan is for testing your model's prediction capabilities, and how (or if) you can adjust the model if it is not predicting well.
I also agree with the suggestion of a table for variables. This is a large dataset, so it would be helpful to have a clear visual breakdown of everything that is being used in data exploration and subsequent analysis.
Just to answer the "black box" question from Slack, I was also referring to machine learning methods such as neural networks which might have good prediction but poor interpretability. It's a whole other approach entirely, so you could consider how to incorporate this into your analysis if you wanted to do that, but it's not necessary given you have a plan outlined already that goes another direction.

Let us know if how your meeting goes and if we can help with any other questions!

Hi all,

Great work on the initial proposal! Here are some comments that you guys could consider:

Perhaps you could focus on explaining your primary objective - predict tax rate for Vancouver municipality in your summary so that your summary sounds more specific and targeted.
You could explain why the focus lies on Vancouver properties - mill rate calculated in a more complicated manner hence could be separately examined and analyzed.
Are you guys certain about creating an automated model/report? If so, maybe phrase it in a more definitive fashion. Otherwise, this should be omitted in your objectives section. "May lead to" is a very vague phrase that's commonly frowned upon when writing objectives for a proposal.
Relationship exploration should come before prediction. Maybe re-order your objectives section to make it sound more logical and readable.
As Gian Carlo and Mallory have already mentioned, a table or a list of the variables and their corresponding domains would definitely increase readability.
It's better to include a brief explanation for why "sum over data" is needed.
One thing to be careful with is that if assessment values differ by a lot and that the dataset is highly skewed with respect to this variable, then the % change on the mean assessed values could some times dilute or exaggerate statistical significance (depending on direction and strength of skewness), therefore leading to a less reflective prediction for each property. I would suggest using other summary statistics instead of the mean if data turns out to be highly skewed.
Test for autocorrelation.
Keep the issue of overfitting in mind when trying polynomial models. Those are generally not used because they're too complex. However, you can still try it since Harry does not care about interpretability.
I think "year" is definitely a significant covariate that should be included. Maybe explain why you guys are considering a model without it?

Overall, great job, and please contact us if you guys have more questions!

Closing because the second version is already uploaded.

hxman027 / RealEstate

450 Proposal Draft Meeting #7

STAT450 Real Estate Proposal Draft