[homework: doing, reading, watching] Linear, logistic regressions and embedding visualizations of Zulip data

nesmaAlmoazamy commented 1 year ago

Doing

[x] Clean Data Thinking Zulip chat data, located at https://github.com/onefact/datathinking.org-codespace/blob/main/data/datathinking.zulipchat.com/raw/messages-000001.json - put it in a polars dataframe and compute summary statistics of the dataset
[x] Analyze this Zulip chat data using logistic regression, linear regression, and embeddings with the tools we have learned in the lectures (don't forget to ask ChatGPT, Claude, Lex, GPT-4 for help as much as you need, and ask for help on the Data Thinking Zulip chat :)
[x] Create a visualization of logistic regression of the Data Thinking Zulip chat data
[x] Create a visualization of linear regression applied to the Data Thinking Zulip chat data
[x] Create a visualization of embeddings using the Data Thinking Zulip Chat data
[x] Make a copy of the Overleaf template: https://www.overleaf.com/read/ghpyzqwqwxpv (need to create an account and/or sign in if this is your first time using Overleaf). To make a copy, open the project after signing in using this link, and click on Menu, then Copy Project:
[x] In Overleaf, edit the template and figure out how to include a PDF figure in the report, alongside a brief description (a few sentences or paragraphs is fine!) of each of the analyses you performed, why you chose them, and the math equation for the linear regression, logistic regression, and embedding you used.
[x] Add the PDF of the report to this issue as a comment.
[x] Send a message on Zulip with a link to this comment, alongside the image representing your favorite visualization

Reviewing

[x] Review how Jaan got unstuck in the lecture recordings at https://panopto.ut.ee/Panopto/Pages/Sessions/List.aspx?folderID=43bb180c-79a6-4324-b055-afa400ecd1a0
[x] Review collaborative whiteboards from past classes: listed at https://www.datathinking.org/university-of-tartu
[x] Review Jupyter notebooks from past classes:

Reading

[x] [inspiration for how to categorize discussion data in json format with chatgpt] https://genmon.github.io/braggoscope/about & https://news.ycombinator.com/item?id=35073603
[x] [help with helping you learn to be a prompt engineer] https://github.com/dair-ai/Prompt-Engineering-Guide
[x] [help with storytelling] https://themarkup.org/hello-world/2023/02/04/journalistic-lessons-for-the-algorithmic-age
[x] [help with debugging] https://wizardzines.com/images/debugging/toc-letter.pdf
[x] [design, visual communication] https://anthonyhobday.com/sideprojects/saferules/ & https://news.ycombinator.com/item?id=34684761
[x] [background on ChatGPT and Claude from an ex-OpenAI, Anthropic founder] read The Boy Whose Light Went Out by Jack Clark http://techpolicylab.uw.edu/wp-content/uploads/2022/04/Telling_Stories_Pages_4-4-22.pdf
[x] [history of OpenAI internal difficulties] https://www.technologyreview.com/2020/02/17/844721/ai-openai-moonshot-elon-musk-sam-altman-greg-brockman-messy-secretive-reality/
[x] [context for supply chain ban on GPU exports/imports] https://www.palladiummag.com/2021/10/11/the-triumph-and-terror-of-wang-huning/
[x] [starting to think about bias in what you read from chatgpt or elsewhere] https://catalogofbias.org/biases/
[x] [psychological tools for support in dealing with ChatGPT, especially if you decide to try jailbreaks] https://www.bellingcat.com/resources/2022/11/23/how-to-maintain-mental-hygiene-as-an-open-source-researcher/
[x] [how data is being used by open source intelligence researchers in war] https://www.cbsnews.com/news/bellingcat-russia-putin-ukraine-60-minutes-2022-08-21/
[x] [background on social media and recommendation algorithms] https://knightcolumbia.org/content/understanding-social-media-recommendation-algorithms

Watching

(message Jaan if you need a VPN or these links don't work)

[x] [context for popular media portrayals of chatgpt] South Park S26E04 - "Deep Learning" - https://southpark.cc.com/episodes/8byci4/south-park-deep-learning-season-26-ep-4
[x] [context for how the CEO of OpenAI talks about it] https://abcnews.go.com/Technology/video/openai-ceo-cto-risks-ai-reshape-society-97949497 article
[x] [an emotional history of the world; context for our data thinking practice of feel, ask, do, think -- and how the thoughts in people's heads can shape society] 1 episode of https://thoughtmaybe.com/cant-get-you-out-of-my-head/

nesmaAlmoazamy commented 1 year ago

Please find the PDF here: Homework_PDF.pdf

nesmaAlmoazamy commented 1 year ago

Feedback about my report: 1- Overall, I feel I spent more time in the data preparation, and implementation than writing explanations in the report 2-I should have added a reference to chat Gpt since I used it in Section 1.3. 3- I still like the plots for word embeddings and word clouds. 4-The logistic regression and linear regression could have presented better with another dataset or in another way rather than the way in the homework. 5- I like using latex to write the equations rather than being a standard paragraph text.

nesmaAlmoazamy commented 1 year ago

Chat GPT Feedback :) Overall, Nesma Mahmoud's homework report provides a good example of how to approach data analysis using text data. The report covers several important steps of data analysis, including data cleaning, data visualization, and modeling. The use of NLTK library for filtering the dataset and preparing it for analysis is a good approach. The visualizations, including word cloud and word frequency plot, provide a quick understanding of the dataset's most frequent words. The word embeddings plot also shows how the embeddings nicely capture the words' relationships based on the dataset.

The report's use of Linear and Logistic Regression models is also appropriate for analyzing the text data. The plots of Actual vs Predicted values provide a quick understanding of the models' performance. The report also provides the equations for both models, which helps in understanding how these models work.

The inclusion of the Word2Vec equation is a nice touch, as it shows how the model calculates the vector representations for the words in the dataset. However, the report does not provide a clear explanation of how the Word2Vec model fits into the analysis or how it is used to generate the word embeddings plot.

Overall, the report provides a good example of how to approach text data analysis using common techniques like data cleaning, data visualization, and modeling. However, a more detailed explanation of the Word2Vec model's role in the analysis could improve the report's clarity.

0rd0s1n1ster commented 1 year ago

My thoughts

WoW, such a nice report! Feels that quite some time was really invested into it. The person who wrote it most likely knows NLP.

However, there are some points to be considered:

Referencing coauther(chat gpt)
More explanation on what is being predicted would be appreciated
Axis labels are missing
Word embeddings (fig. 6) - text is hard to distinguish, perhaps should be smaller

Chat gpt "thoughts": To improve the report, some suggestions are:

Provide a clear introduction or background to the project.
Provide a more detailed explanation of the data cleaning process, including the specific methods used and some information on the size and characteristics of the dataset.
Include some evaluation metrics or measures of model performance to give a better sense of how well the logistic and linear regression models are performing.
Provide more context and explanation around the word2vec algorithm and the specific equations used. Include more visualizations or graphics to help illustrate the findings and results of the analysis.

onefact / datathinking.org