jtothehoenderdos / MasterThesis

0 stars 0 forks source link

clarifications requested #10

Open maartenmarx opened 3 years ago

maartenmarx commented 3 years ago

Hi @jtothehoenderdos CC @mckeuken

I got a review from the independent reviewer and he is OK with defense provided the following points are addressed and a clear rebuttal indicating how each point is addressed is provided. Please indicate line numbers and exact working and if needed an explanantion of how you addressed each point.

Good luck,

maarten

Here are my impressions after reading this report: The Format using ACM template needs a minor revision, such as tables and figures. 1) Title: for using machine learning techniques is very ambiguous and non-informative what the thesis is. 2) Abstract: the wording is not precisely correct to motivate the reader go on. The last paragraph is not clear. 3) About the Data Analytic Strategies he used, I dont see a clear picture of, he focused only on modeling, as follows:

jtothehoenderdos commented 3 years ago

Hi Maarten,

hier mijn antwoorden op de vragen en eerder gestelde vragen:

For readability the comments by Maarten and Max are given in Bold. My response is underlined and the changes in the text are given in italics.

1) Title: for using machine learning techniques is very ambiguous and non-informative what the thesis is. I have changed the title so that it covers the content of the thesis. The title now reads (p2.):

Identifying predictive sociodemographic and neighborhood features for youth care demand in the municipality of Amsterdam - a machine learning approach.

2) Abstract: the wording is not precisely correct to motivate the reader go on. The last paragraph is not clear. I have rephrased the abstract so that the wording is more precise and I have clarified the last paragraph. The abstract now reads (p1.): In 2015, the organization of the Dutch youth care system was changed radically by transferring all responsibility to the municipalities. This new way of working was introduced to make youth care more efficient, coherent, and cost-effective. However, the demand for youth care continued to increase, and as a result the waiting lists are getting longer and longer. Within the municipality, there is a need clear need to understand what contributes to the demand as to ensure optimal use of the limited resources available in the youth care system. Therefore, using three machine learning algorithms (support vector machine, Decision Trees, Gradient Boost) we will predict the demand for specialist youth care based on the demographic and neighborhood characteristics. Based on the used models, the Decision Tree Classifier was the model with the highest F1-score. In this model the most import feature in predicting youth care demand was the feature "amount approved” and reflects the cost of a given treatment.

3) About the Data Analytic Strategies he used, I dont see a clear picture of, he focused only on modeling, as follows: • Perform adequate exploratory analysis of the data and provide a complete, yet succinct, presentation of the results. • Clearly state the statistical model used when presenting model estimates. • Clearly state the model building/selection/validation criteria used to address the scientific question(s) of interest. (He somehow attempted to do it, but it was not reflected on Discussion section about this). • Perform adequate model diagnostics (I dont see his reflection on Discussion section about this). • Provide precise interpretations of the parameters in your model (or your estimates of those parameters) in the context of the scientific problem (I dont see his reflection on Discussion section about this). -> vind je het goed om hier over te bellen, om het een en ander te overleggen?

4) Provide a clear statement in the introduction on what you are trying to predict based on which features I have incorporated an explicit statement at the end of the introduction what the models are trying to predict using which features. The introduction now reads (p2):

In line with these suggestions, we set out to investigate the value of demographic and neighborhood characteristics in predicting youth care need through the use of machine learning (ML). Three ML algorithms will be used to create a predictive model for youth care needs. Specifically, we will predict the type of youth care use of Amsterdam clients in the years 2018 and 2019 on the basis of their demographic and neighborhood characteristics.

5) Please check the numbers in table 1, 2, and 3. Also provide the numbers per year in each table. If you suppress any categories due to privacy restrictions mention this explicitly in the table text.
I have extended table 2 and 3 to provide the descriptive statistics per year. Furthermore, I have double checked the numbers to ensure that they are correct. The updated tables can be found on page 4 .

6) Provide a table where you give a description of the final dataset that was used to fit all your models with (so after cleaning etc). I have incorporated an additional table to provide a description of the final dataset. The new table 4 can be found on page 6.
7) Provide histograms for all included variables For every single variable that is included in the three different ML models I have created a histogram to show the variability across the different subjects. On page 6 I refer to those histograms in the following manner:

A histogram of all the available features is made and can be found on the GitHub repository5.

The link to the GitHub repository is given in footnote 5 on page 8.

8) Expand the figure texts so that it becomes clear what is actually displayed For every figure (on p2, p4, p6, p8, p9 and p10) I have extended the figure texts to explain what is shown.

9) Provide the SD of the F1 scores in table 6. I have extended table 7 (previously table 6) on page 8 by including the standard deviation of the F1 scores across the different cross validation runs. The table text now reads (p8): For every tested model the average F1 scores of the CV. The standard deviation is provided between brackets. Every model was trained on the entire dataset (Baseline model), on the undersampled but balanced dataset (Random Undersample) and finally, where computationally possible, the parameters of the Baseline model were optimized by a grid search approach (GridSearchCV). The model that scored the highest given the used dataset is marked in bold.

10) Expand the section on how you used random undersampling I have incorporated this comment by expanding the text and by including a new figure. The adjusted text now reads (p8): As stated in the introduction and method section, imbalanced data might be detrimental to the overall performance of ML models. To quantify this, we used a random undersample technique and fit the three models on the reduced dataset. To visualize the result of the random undersampling, and therefore see which data we used, we made a figure which can be seen in Figure 4. You can see that all the services, which are on the X-axis, have the same amount of samples (Y-axis). A big difference compared with figure we made before in Figure 3. The new figure is shown on p8

11) Incorporate a discussion section on why logistic regression was not used. I have extended the discussion where I have included a section on the use of logistic regression (p11): Future research might also consider the use of other ML algorithms than the ones included in the current thesis. A method that comes to mind is Logistic Regression (LR), which would be an interesting alternative compared to SVM for imbalanced datasets. However, given the number of features, training samples and number of classes, it is unlikely that LR will outperform the SVM with the youth care data[REF].

Hier onder ook de nieuwe versie: Master_Thesis (36).pdf

Naar mijn inziens is de thesis nu wel verdedigbaar, en stel ik voor om in de week van 14 December mijn verdiging te plannen, eens?

Ik hoor graag van je. Jop

maartenmarx commented 3 years ago

Beste @jop, hartelijk dank. Ik zie flink wat verbeteringen.

Het blijft voor mij lastig in te zien wat je hebt gedaan in ej scriptie. Graag zie ik echt antwoorden op

  1. Vraag 3 van de 2e reviewer is de balngrijkste, en laat je nog helemaal open. Echt je zal al die vragen uitgebreid moeten beantwoorden. Veel van zijn punten komen overeen met mijn eerdere vragen, die je ook nog niet beantwoord hebt.
  2. Al mijn eerdere vragen die onbeantwoord zijn gebleven
  3. Een overzicht van de y-variabele: Ik hoef de namen niet te weten maar wel per klasse, hoeveel mensen erin zitten. Voor je baseline en voor je undersampled model. (dus je kunt de klasses gewoon letetrs geven).
  4. Een heldere baseline gebaseerd op majority class vote.

Ik stel voor dat je nog 1 keer probeert echt flink duidelijk te worden, gebruik makend van de enorme hoop feedback van onze kant, en dat wij op basis daarvan beslissen of dit verdedigbaar is.

jtothehoenderdos commented 3 years ago

Beste @maartenmarx ,

Dank voor je antwoord. En dank voor de enorme hoop feedback vanuit jou kant. De thesis wordt hierdoor alleen maar beter van.

Ik merk dat ik het fijner vind om telefonisch overleg te plegen over de gestelde vragen, waneer zou jou dit uitkomen?

"en dat wij op basis daarvan beslissen of dit verdedigbaar is" -> deze zin snap ik niet helemaal. Jou begeleider is het er nu toch ook mee eens dat deze thesis verdedigbaar is?

Ik hoor graag van je,

Jop

maartenmarx commented 3 years ago

Hi Jop,

We kunnen telefonisch overleggen, maar ik wil eerst echt graag dat je ingaat op mijn feedback. Die is helder wat mij betreft. Zoals gezegd blijft het voor mij onduidelijk wat je in je scriptie doet, en je hoeft het niet aan mij telefonisch uit te leggen, maar je scriptie zal het duidelijk moeten maken. Die komt voor altijd op het web te staan. Ik wil graag dat we daar alle vier achter kunnen staan.

Echt, zoals je nu van 2 mensen hebt gehoord, we willen graag exact snappen wat je doet en dat we dat in de scriptie kunnen lezen. Dit is een basis eis aan een scriptie.

Met vriendelijke groeten,

Maarten Marx


Maarten Marx +31 06 400 16 120 maartenmarx@uva.nl IRlab/ILPS Informatics Institute Universiteit van Amsterdam

On Dec 6, 2020, at 14:49 PM, jtothehoenderdos notifications@github.com wrote:

Beste @maartenmarx https://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmaartenmarx&data=04%7C01%7CM.J.Marx%40uva.nl%7Ca5248368cf524d8f7c0508d899edc39f%7Ca0f1cacd618c4403b94576fb3d6874e5%7C1%7C0%7C637428593755360725%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=l%2FSYhPEuDO1KXEmU9cz5X2%2BH2obAggEh5x1xLmX5x%2Fg%3D&reserved=0 ,

Dank voor je antwoord. En dank voor de enorme hoop feedback vanuit jou kant. De thesis wordt hierdoor alleen maar beter van.

Ik merk dat ik het fijner vind om telefonisch overleg te plegen over de gestelde vragen, waneer zou jou dit uitkomen?

"en dat wij op basis daarvan beslissen of dit verdedigbaar is" -> deze zin snap ik niet helemaal. Jou begeleider is het er nu toch ook mee eens dat deze thesis verdedigbaar is?

Ik hoor graag van je,

Jop

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjtothehoenderdos%2FMasterThesis%2Fissues%2F10%23issuecomment-739505401&data=04%7C01%7CM.J.Marx%40uva.nl%7Ca5248368cf524d8f7c0508d899edc39f%7Ca0f1cacd618c4403b94576fb3d6874e5%7C1%7C0%7C637428593755360725%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=j8gFlNTSUvduLXborqU0VRwVP4ahM7QH%2B15sX%2BU38a0%3D&reserved=0, or unsubscribe https://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA4ZK4HQNE53CMXGR4HNZNTSTODWZANCNFSM4UNHI7OA&data=04%7C01%7CM.J.Marx%40uva.nl%7Ca5248368cf524d8f7c0508d899edc39f%7Ca0f1cacd618c4403b94576fb3d6874e5%7C1%7C0%7C637428593755370681%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Mw3u%2BFcwsoYrNR3u5prk6Lc9q5RJZ6wH3dHSjzvMSDM%3D&reserved=0.