Closed aalksnk closed 1 year ago
From me: 3/5 plots in the report are really hard to understand and I am still not sure if they're making any sense. I could've done better with that part, perhaps preprocessing the data better. However I like the description part, it's nice and provides all the details I thought were important. From GPT4: The report provides a good start for predicting the sender of a message based on the first word in the message. The code loads a JSON file containing messages and sender IDs, preprocesses the data, converts it into a numerical format using embeddings, trains a linear regression model on the training data, evaluates its performance on the test data, and plots a residual plot and distribution plot. However, there are some limitations and areas for improvement, such as the absence of preprocessing, the consideration of only the first word in the message, and the use of a linear regression model. Future research can explore these areas to improve the performance of the model. The report mentions several areas for improvement, such as the absence of preprocessing, the consideration of only the first word in the message, and the use of a linear regression model. The visualization in the report includes a residual plot and distribution plot, but it is not clear what doesn't make sense in it. The writing in the report is generally clear, but there may be some technical terms that could be elaborated on for better understanding.
From me: I'd say in the context of this class, it's a very fine report, with an exceptionally clear explanation of the report's purpose and methods used. I thought the confusion matrix was a cool addition, albeit with a crowded x-axis label. Perhaps the figures could have been added before the references? Otherwise, I enjoyed reading the homework, and seeing the unique ways the author went about completing it:
And then a few critical points from ChatGPT, but that I personally don't completely agree with:
Doing
[x] Clean Data Thinking Zulip chat data, located at https://github.com/onefact/datathinking.org-codespace/blob/main/data/datathinking.zulipchat.com/raw/messages-000001.json - put it in a
polars
dataframe and compute summary statistics of the dataset[x] Analyze this Zulip chat data using logistic regression, linear regression, and embeddings with the tools we have learned in the lectures (don't forget to ask ChatGPT, Claude, Lex, GPT-4 for help as much as you need, and ask for help on the Data Thinking Zulip chat :)
[x] Create a visualization of logistic regression of the Data Thinking Zulip chat data
[x] Create a visualization of linear regression applied to the Data Thinking Zulip chat data
[x] Create a visualization of embeddings using the Data Thinking Zulip Chat data
[x] Make a copy of the Overleaf template: https://www.overleaf.com/read/ghpyzqwqwxpv (need to create an account and/or sign in if this is your first time using Overleaf). To make a copy, open the project after signing in using this link, and click on
Menu
, thenCopy Project
:[x] In Overleaf, edit the template and figure out how to include a PDF figure in the report, alongside a brief description (a few sentences or paragraphs is fine!) of each of the analyses you performed, why you chose them, and the math equation for the linear regression, logistic regression, and embedding you used.
[x] Add the PDF of the report to this issue as a comment.
[x] Send a message on Zulip with a link to this comment, alongside the image representing your favorite visualization
Reviewing
Reading
json
format with chatgpt] https://genmon.github.io/braggoscope/about & https://news.ycombinator.com/item?id=35073603The Boy Whose Light Went Out
by Jack Clark http://techpolicylab.uw.edu/wp-content/uploads/2022/04/Telling_Stories_Pages_4-4-22.pdfWatching
(message Jaan if you need a VPN or these links don't work)