onefact / datathinking.org

Data Thinking website deployed using GitHub Pages
https://datathinking.org
Apache License 2.0
7 stars 7 forks source link

[homework: doing, reading, watching] Linear, logistic regressions and embedding visualizations of Zulip data #141

Closed aalksnk closed 1 year ago

aalksnk commented 1 year ago

Doing

Reviewing

Reading

Watching

(message Jaan if you need a VPN or these links don't work)

aalksnk commented 1 year ago

hw3report.pdf

aalksnk commented 1 year ago

From me: 3/5 plots in the report are really hard to understand and I am still not sure if they're making any sense. I could've done better with that part, perhaps preprocessing the data better. However I like the description part, it's nice and provides all the details I thought were important. From GPT4: The report provides a good start for predicting the sender of a message based on the first word in the message. The code loads a JSON file containing messages and sender IDs, preprocesses the data, converts it into a numerical format using embeddings, trains a linear regression model on the training data, evaluates its performance on the test data, and plots a residual plot and distribution plot. However, there are some limitations and areas for improvement, such as the absence of preprocessing, the consideration of only the first word in the message, and the use of a linear regression model. Future research can explore these areas to improve the performance of the model. The report mentions several areas for improvement, such as the absence of preprocessing, the consideration of only the first word in the message, and the use of a linear regression model. The visualization in the report includes a residual plot and distribution plot, but it is not clear what doesn't make sense in it. The writing in the report is generally clear, but there may be some technical terms that could be elaborated on for better understanding.

ikr503 commented 1 year ago

From me: I'd say in the context of this class, it's a very fine report, with an exceptionally clear explanation of the report's purpose and methods used. I thought the confusion matrix was a cool addition, albeit with a crowded x-axis label. Perhaps the figures could have been added before the references? Otherwise, I enjoyed reading the homework, and seeing the unique ways the author went about completing it:

And then a few critical points from ChatGPT, but that I personally don't completely agree with:

  1. The report discusses using the first word of a message to predict the sender. However, it lacks a clear explanation of why this approach is taken or what insights were derived from the data to support this choice.
  2. Overreliance on Models: While the report delves into methodologies like linear regression and embeddings, it overlooks the basic exploratory methods and data cleansing. It is crucial to remember that a model's effectiveness is only as good as the data it receives.
  3. Limited Metrics for Evaluation: The report mentions the calculation of mean squared error for evaluating the linear regression model. However, it does not seem to discuss the relevance or adequacy of this metric for the problem at hand.