[homework: doing, reading, watching] Linear, logistic regressions and embedding visualizations of Zulip data

siimre commented 1 year ago

Doing

[x] Clean Data Thinking Zulip chat data, located at https://github.com/onefact/datathinking.org-codespace/blob/main/data/datathinking.zulipchat.com/raw/messages-000001.json - put it in a polars dataframe and compute summary statistics of the dataset
[x] Analyze this Zulip chat data using logistic regression, linear regression, and embeddings with the tools we have learned in the lectures (don't forget to ask ChatGPT, Claude, Lex, GPT-4 for help as much as you need, and ask for help on the Data Thinking Zulip chat :)
[x] Create a visualization of logistic regression of the Data Thinking Zulip chat data
[x] Create a visualization of linear regression applied to the Data Thinking Zulip chat data
[x] Create a visualization of embeddings using the Data Thinking Zulip Chat data
[x] Make a copy of the Overleaf template: https://www.overleaf.com/read/ghpyzqwqwxpv (need to create an account and/or sign in if this is your first time using Overleaf). To make a copy, open the project after signing in using this link, and click on Menu, then Copy Project:
[x] In Overleaf, edit the template and figure out how to include a PDF figure in the report, alongside a brief description (a few sentences or paragraphs is fine!) of each of the analyses you performed, why you chose them, and the math equation for the linear regression, logistic regression, and embedding you used.
[x] Add the PDF of the report to this issue as a comment.
[x] Send a message on Zulip with a link to this comment, alongside the image representing your favorite visualization

Reviewing

[x] Review how Jaan got unstuck in the lecture recordings at https://panopto.ut.ee/Panopto/Pages/Sessions/List.aspx?folderID=43bb180c-79a6-4324-b055-afa400ecd1a0
[x] Review collaborative whiteboards from past classes: listed at https://www.datathinking.org/university-of-tartu
[x] Review Jupyter notebooks from past classes:

Reading

[x] [inspiration for how to categorize discussion data in json format with chatgpt] https://genmon.github.io/braggoscope/about & https://news.ycombinator.com/item?id=35073603
[x] [help with helping you learn to be a prompt engineer] https://github.com/dair-ai/Prompt-Engineering-Guide
[x] [help with storytelling] https://themarkup.org/hello-world/2023/02/04/journalistic-lessons-for-the-algorithmic-age
[x] [help with debugging] https://wizardzines.com/images/debugging/toc-letter.pdf
[x] [design, visual communication] https://anthonyhobday.com/sideprojects/saferules/ & https://news.ycombinator.com/item?id=34684761
[x] [background on ChatGPT and Claude from an ex-OpenAI, Anthropic founder] read The Boy Whose Light Went Out by Jack Clark http://techpolicylab.uw.edu/wp-content/uploads/2022/04/Telling_Stories_Pages_4-4-22.pdf
[x] [history of OpenAI internal difficulties] https://www.technologyreview.com/2020/02/17/844721/ai-openai-moonshot-elon-musk-sam-altman-greg-brockman-messy-secretive-reality/
[x] [context for supply chain ban on GPU exports/imports] https://www.palladiummag.com/2021/10/11/the-triumph-and-terror-of-wang-huning/
[x] [starting to think about bias in what you read from chatgpt or elsewhere] https://catalogofbias.org/biases/
[x] [psychological tools for support in dealing with ChatGPT, especially if you decide to try jailbreaks] https://www.bellingcat.com/resources/2022/11/23/how-to-maintain-mental-hygiene-as-an-open-source-researcher/
[x] [how data is being used by open source intelligence researchers in war] https://www.cbsnews.com/news/bellingcat-russia-putin-ukraine-60-minutes-2022-08-21/
[x] [background on social media and recommendation algorithms] https://knightcolumbia.org/content/understanding-social-media-recommendation-algorithms

Watching

(message Jaan if you need a VPN or these links don't work)

[x] [context for popular media portrayals of chatgpt] South Park S26E04 - "Deep Learning" - https://southpark.cc.com/episodes/8byci4/south-park-deep-learning-season-26-ep-4
[x] [context for how the CEO of OpenAI talks about it] https://abcnews.go.com/Technology/video/openai-ceo-cto-risks-ai-reshape-society-97949497 article
[x] [an emotional history of the world; context for our data thinking practice of feel, ask, do, think -- and how the thoughts in people's heads can shape society] 1 episode of https://thoughtmaybe.com/cant-get-you-out-of-my-head/

siimre commented 1 year ago

HW3_by_SR.pdf

siimre commented 1 year ago

Critique from me: The regression results are not correct. Linear regression should be like a line. Logistic regression would look like a hump. All the formulas are nicely laid out and explained. References are correct.

siimre commented 1 year ago

Critique from ChatGPT:

Abstract: The abstract provides a brief overview of the document, mentioning the context of Homework 3 in the DataThinking course and the focus on analyzing a conversation dataset. However, it could benefit from providing more specific details about the dataset and the insights derived.
Introduction: The introduction briefly introduces the purpose of the study, which is to examine conversation patterns, critical topics, and the relationship between message content, length, and sender identification. It could be improved by providing more context on why these aspects are important and what potential applications or implications they might have.
Data, Cleaning, and Preprocessing: This section mentions the use of Zulip chat data for analysis and the extraction of message content and sender identification. However, it lacks details about the cleaning and preprocessing steps performed on the data. Providing more information about the specific techniques used would enhance the clarity and reproducibility of the study.
Analyze: This section briefly mentions the analysis performed on the dataset, including logistic regression, linear regression, and word embeddings. However, it lacks a clear explanation of the specific insights or findings derived from these analyses. Adding more details about the results and their implications would strengthen this section.
Conclusion: The conclusion provides a summary of the study, highlighting the stages of data gathering, exploratory analysis, model development, and evaluation. However, it lacks specific details about the key findings or insights gained from the analysis. Additionally, it mentions the challenges faced during coding and provides a reference to a helpful guide, but it does not elaborate on the solutions or lessons learned from overcoming those challenges.
References: The references section includes URLs for the dataset, debugging guide, and code file used. However, it lacks traditional academic referencing style, such as author names, publication dates, and journal/conference references.

Overall, the document provides an outline of the study and mentions the methods used for analysis. However, it lacks in-depth analysis and specific findings. To improve the document, it would be beneficial to include more detailed explanations of the insights gained from the analysis, provide a clearer structure for each section, and adhere to standard academic referencing conventions.

onefact / datathinking.org