[homework: doing, reading, watching] Linear, logistic regressions and embedding visualizations of Zulip data

aalksnk commented 1 year ago

Doing

[x] Clean Data Thinking Zulip chat data, located at https://github.com/onefact/datathinking.org-codespace/blob/main/data/datathinking.zulipchat.com/raw/messages-000001.json - put it in a polars dataframe and compute summary statistics of the dataset
[x] Analyze this Zulip chat data using logistic regression, linear regression, and embeddings with the tools we have learned in the lectures (don't forget to ask ChatGPT, Claude, Lex, GPT-4 for help as much as you need, and ask for help on the Data Thinking Zulip chat :)
[x] Create a visualization of logistic regression of the Data Thinking Zulip chat data
[x] Create a visualization of linear regression applied to the Data Thinking Zulip chat data
[x] Create a visualization of embeddings using the Data Thinking Zulip Chat data
[x] Make a copy of the Overleaf template: https://www.overleaf.com/read/ghpyzqwqwxpv (need to create an account and/or sign in if this is your first time using Overleaf). To make a copy, open the project after signing in using this link, and click on Menu, then Copy Project:
[x] In Overleaf, edit the template and figure out how to include a PDF figure in the report, alongside a brief description (a few sentences or paragraphs is fine!) of each of the analyses you performed, why you chose them, and the math equation for the linear regression, logistic regression, and embedding you used.
[x] Add the PDF of the report to this issue as a comment.
[x] Send a message on Zulip with a link to this comment, alongside the image representing your favorite visualization

Reviewing

[x] Review how Jaan got unstuck in the lecture recordings at https://panopto.ut.ee/Panopto/Pages/Sessions/List.aspx?folderID=43bb180c-79a6-4324-b055-afa400ecd1a0
[x] Review collaborative whiteboards from past classes: listed at https://www.datathinking.org/university-of-tartu
[x] Review Jupyter notebooks from past classes:

Reading

[x] [inspiration for how to categorize discussion data in json format with chatgpt] https://genmon.github.io/braggoscope/about & https://news.ycombinator.com/item?id=35073603
[x] [help with helping you learn to be a prompt engineer] https://github.com/dair-ai/Prompt-Engineering-Guide
[x] [help with storytelling] https://themarkup.org/hello-world/2023/02/04/journalistic-lessons-for-the-algorithmic-age
[x] [help with debugging] https://wizardzines.com/images/debugging/toc-letter.pdf
[x] [design, visual communication] https://anthonyhobday.com/sideprojects/saferules/ & https://news.ycombinator.com/item?id=34684761
[x] [background on ChatGPT and Claude from an ex-OpenAI, Anthropic founder] read The Boy Whose Light Went Out by Jack Clark http://techpolicylab.uw.edu/wp-content/uploads/2022/04/Telling_Stories_Pages_4-4-22.pdf
[x] [history of OpenAI internal difficulties] https://www.technologyreview.com/2020/02/17/844721/ai-openai-moonshot-elon-musk-sam-altman-greg-brockman-messy-secretive-reality/
[x] [context for supply chain ban on GPU exports/imports] https://www.palladiummag.com/2021/10/11/the-triumph-and-terror-of-wang-huning/
[x] [starting to think about bias in what you read from chatgpt or elsewhere] https://catalogofbias.org/biases/
[x] [psychological tools for support in dealing with ChatGPT, especially if you decide to try jailbreaks] https://www.bellingcat.com/resources/2022/11/23/how-to-maintain-mental-hygiene-as-an-open-source-researcher/
[x] [how data is being used by open source intelligence researchers in war] https://www.cbsnews.com/news/bellingcat-russia-putin-ukraine-60-minutes-2022-08-21/
[x] [background on social media and recommendation algorithms] https://knightcolumbia.org/content/understanding-social-media-recommendation-algorithms

Watching

(message Jaan if you need a VPN or these links don't work)

[x] [context for popular media portrayals of chatgpt] South Park S26E04 - "Deep Learning" - https://southpark.cc.com/episodes/8byci4/south-park-deep-learning-season-26-ep-4
[x] [context for how the CEO of OpenAI talks about it] https://abcnews.go.com/Technology/video/openai-ceo-cto-risks-ai-reshape-society-97949497 article
[x] [an emotional history of the world; context for our data thinking practice of feel, ask, do, think -- and how the thoughts in people's heads can shape society] 1 episode of https://thoughtmaybe.com/cant-get-you-out-of-my-head/

aalksnk commented 1 year ago

hw3report.pdf

aalksnk commented 1 year ago

From me: 3/5 plots in the report are really hard to understand and I am still not sure if they're making any sense. I could've done better with that part, perhaps preprocessing the data better. However I like the description part, it's nice and provides all the details I thought were important. From GPT4: The report provides a good start for predicting the sender of a message based on the first word in the message. The code loads a JSON file containing messages and sender IDs, preprocesses the data, converts it into a numerical format using embeddings, trains a linear regression model on the training data, evaluates its performance on the test data, and plots a residual plot and distribution plot. However, there are some limitations and areas for improvement, such as the absence of preprocessing, the consideration of only the first word in the message, and the use of a linear regression model. Future research can explore these areas to improve the performance of the model. The report mentions several areas for improvement, such as the absence of preprocessing, the consideration of only the first word in the message, and the use of a linear regression model. The visualization in the report includes a residual plot and distribution plot, but it is not clear what doesn't make sense in it. The writing in the report is generally clear, but there may be some technical terms that could be elaborated on for better understanding.

ikr503 commented 1 year ago

From me: I'd say in the context of this class, it's a very fine report, with an exceptionally clear explanation of the report's purpose and methods used. I thought the confusion matrix was a cool addition, albeit with a crowded x-axis label. Perhaps the figures could have been added before the references? Otherwise, I enjoyed reading the homework, and seeing the unique ways the author went about completing it:

And then a few critical points from ChatGPT, but that I personally don't completely agree with:

The report discusses using the first word of a message to predict the sender. However, it lacks a clear explanation of why this approach is taken or what insights were derived from the data to support this choice.
Overreliance on Models: While the report delves into methodologies like linear regression and embeddings, it overlooks the basic exploratory methods and data cleansing. It is crucial to remember that a model's effectiveness is only as good as the data it receives.
Limited Metrics for Evaluation: The report mentions the calculation of mean squared error for evaluating the linear regression model. However, it does not seem to discuss the relevance or adequacy of this metric for the problem at hand.

onefact / datathinking.org