[homework: doing, reading, watching] Linear, logistic regressions and embedding visualizations of Zulip data

onefact / datathinking.org

Data Thinking website deployed using GitHub Pages

Apache License 2.0

7 stars 7 forks source link

Doing

[x] Clean Data Thinking Zulip chat data, located at https://github.com/onefact/datathinking.org-codespace/blob/main/data/datathinking.zulipchat.com/raw/messages-000001.json - put it in a polars dataframe and compute summary statistics of the dataset
[x] Analyze this Zulip chat data using logistic regression, linear regression, and embeddings with the tools we have learned in the lectures (don't forget to ask ChatGPT, Claude, Lex, GPT-4 for help as much as you need, and ask for help on the Data Thinking Zulip chat :)
[x] Create a visualization of logistic regression of the Data Thinking Zulip chat data
[x] Create a visualization of linear regression applied to the Data Thinking Zulip chat data
[x] Create a visualization of embeddings using the Data Thinking Zulip Chat data
[x] Make a copy of the Overleaf template: https://www.overleaf.com/read/ghpyzqwqwxpv (need to create an account and/or sign in if this is your first time using Overleaf). To make a copy, open the project after signing in using this link, and click on Menu, then Copy Project:
[x] In Overleaf, edit the template and figure out how to include a PDF figure in the report, alongside a brief description (a few sentences or paragraphs is fine!) of each of the analyses you performed, why you chose them, and the math equation for the linear regression, logistic regression, and embedding you used.
[x] Add the PDF of the report to this issue as a comment.
[x] Send a message on Zulip with a link to this comment, alongside the image representing your favorite visualization

Reviewing

[x] Review how Jaan got unstuck in the lecture recordings at https://panopto.ut.ee/Panopto/Pages/Sessions/List.aspx?folderID=43bb180c-79a6-4324-b055-afa400ecd1a0
[x] Review collaborative whiteboards from past classes: listed at https://www.datathinking.org/university-of-tartu
[x] Review Jupyter notebooks from past classes:

Reading

[x] [inspiration for how to categorize discussion data in json format with chatgpt] https://genmon.github.io/braggoscope/about & https://news.ycombinator.com/item?id=35073603
[x] [help with helping you learn to be a prompt engineer] https://github.com/dair-ai/Prompt-Engineering-Guide
[x] [help with storytelling] https://themarkup.org/hello-world/2023/02/04/journalistic-lessons-for-the-algorithmic-age
[x] [help with debugging] https://wizardzines.com/images/debugging/toc-letter.pdf
[x] [design, visual communication] https://anthonyhobday.com/sideprojects/saferules/ & https://news.ycombinator.com/item?id=34684761
[x] [background on ChatGPT and Claude from an ex-OpenAI, Anthropic founder] read The Boy Whose Light Went Out by Jack Clark http://techpolicylab.uw.edu/wp-content/uploads/2022/04/Telling_Stories_Pages_4-4-22.pdf
[x] [history of OpenAI internal difficulties] https://www.technologyreview.com/2020/02/17/844721/ai-openai-moonshot-elon-musk-sam-altman-greg-brockman-messy-secretive-reality/
[x] [context for supply chain ban on GPU exports/imports] https://www.palladiummag.com/2021/10/11/the-triumph-and-terror-of-wang-huning/
[x] [starting to think about bias in what you read from chatgpt or elsewhere] https://catalogofbias.org/biases/
[x] [psychological tools for support in dealing with ChatGPT, especially if you decide to try jailbreaks] https://www.bellingcat.com/resources/2022/11/23/how-to-maintain-mental-hygiene-as-an-open-source-researcher/
[x] [how data is being used by open source intelligence researchers in war] https://www.cbsnews.com/news/bellingcat-russia-putin-ukraine-60-minutes-2022-08-21/
[x] [background on social media and recommendation algorithms] https://knightcolumbia.org/content/understanding-social-media-recommendation-algorithms

Watching

(message Jaan if you need a VPN or these links don't work)

[x] [context for popular media portrayals of chatgpt] South Park S26E04 - "Deep Learning" - https://southpark.cc.com/episodes/8byci4/south-park-deep-learning-season-26-ep-4
[x] [context for how the CEO of OpenAI talks about it] https://abcnews.go.com/Technology/video/openai-ceo-cto-risks-ai-reshape-society-97949497 article
[x] [an emotional history of the world; context for our data thinking practice of feel, ask, do, think -- and how the thoughts in people's heads can shape society] 1 episode of https://thoughtmaybe.com/cant-get-you-out-of-my-head/

Red Teaming

My own critique (10 mins):

While personal experience in the abstract is nice, it does not represent anything about what the report includes, what the goals and methodology of the work done

The report doesn't have much structure, there’s only one part called “Main part”, so it’s difficult to find, for example, what logistic regression was used for.

The first paragraph of the main part is more suitable for the abstract than the abstract itself (biased opinion)

The first figure is very hard to read. In theory it should visualise both - embeddings and logistic regression results, but sender ids are discrete, and they are impossible to read on a linear axis.

All the figures have the same description, so it’s not helpful. One of the figures appears twice.

I cannot really make any conclusions from the plots, the titles are not descriptive and there’s no legend (or axis names in the last one).

Embeddings are used in every single experiment, but they’re fully explained in the last one. Maybe changing the order is a good idea.

Critique from chatGPT (I Don’t have access to GPT-4 yet):

This homework report has several issues:

Lack of clarity and structure: The report lacks a clear structure and organization, making it difficult to follow and understand. The author jumps between different topics without providing a clear introduction or conclusion.

Poor writing: The report contains several grammatical errors and lacks coherence and clarity in its writing style. This makes it hard to understand the author's ideas and arguments.

Lack of depth and analysis: The report provides a basic description of the methods used, but does not provide a detailed analysis or evaluation of the results. The author also does not provide any context or background information on the DataThinking Zulip chat data or the techniques used.

Inappropriate tone: The author's use of language is unprofessional and inappropriate for an academic report. The report contains several informal and unprofessional phrases that undermine the credibility of the author and the report.

Lack of reproducibility: The report does not provide enough details on the data cleaning and preprocessing steps, making it difficult to reproduce the analysis. Additionally, the report does not provide any code or scripts used for the analysis, which makes it impossible to verify the results.

Red teaming

My critique 10 mins

The report is not well structured and academic writing style is much to be desired. A little bit of irritation is sensible, wide definition bounds which suit well homework formulation.

Things to be improved:

Lack of details.

Figures repetition

Sarcastic style.

Lack of logical flow(but at least some was present)

Chat GPT 3.5 (GPT 4 failed)

Introduction: The opening paragraph does not provide any context for the report and does not relate to the main analysis. It should be removed or rewritten to introduce the purpose of the report.

Language: The language used is informal and unprofessional. The report should be written in a more formal and technical language appropriate for a research report.

Methods: The methods used for data cleaning and preprocessing are not described. It is unclear how the Word2Vec model was trained and how the embeddings were generated. The report should provide more detail about the methodology used.

Results: The results are poorly presented. The figures are not labeled or explained, and it is unclear what they represent. The report should provide clear and concise explanations of the results, including the statistical significance and interpretation of the findings.

Conclusion: The conclusion does not summarize the findings or provide any recommendations for future research. The report should conclude with a clear and concise summary of the findings and suggestions for further investigation.

Overall structure: The report lacks a clear structure and organization. It should be reorganized into clear sections, with headings and subheadings to guide the reader through the report.

Professionalism: The report includes inappropriate comments and self-deprecation. These should be removed to maintain a professional and objective tone.

onefact / datathinking.org