artu134 commented 1 year ago

Reading

Pro tip: try using an app on your phone or computer to read aloud to you at 1.5x speed! This can save time and make it easier to absorb information while not being tied down to a computer or device visually.

[x] Review the bug report we submitted: https://github.com/duckdb/duckdb/issues/7015 and the fix to avoid the bug we ran into for analyzing Zulip data: https://github.com/onefact/datathinking.org-codespace/blob/main/notebooks/in-class-notebooks/230420-zulip-chat-duckdb-loading.ipynb
[x] Review several editions of Style, Clarity and Grace (https://bookshop.org/p/books/style-lessons-in-clarity-and-grace-joseph-williams/8980182 - Libgen)
[x] [climate harms] Read https://arxiv.org/ftp/arxiv/papers/2104/2104.10350.pdf
[x] [global context thanks to @indrekromet] Read https://www.palladiummag.com/2023/02/23/the-west-lives-on-in-the-talibans-afghanistan/
Preparation for guest lecture
- [x] Read https://arxiv.org/abs/2302.10329
- [x] https://www.economist.com/science-and-technology/2023/04/05/it-doesnt-take-much-to-make-machine-learning-algorithms-go-awry (https://archive.ph/5l1k3#selection-1038.0-1038.1)
[x] [David Foster Wallace on Consumption] https://harpers.org/wp-content/uploads/2008/09/HarpersMagazine-1996-01-0007859.pdf
[x] [Cross-Cultural Research and Ethnography] Read excerpt from https://www.thefieldstudyhandbook.com/ - https://drive.google.com/file/d/1F9zr1pHesGPSOTtXLQu4hTxOGu7zoIRx/view?usp=sharing
[ ] [Status games, social engineering] Read the chapter on status games in Impro (https://bookshop.org/p/books/impro-improvisation-and-the-theatre-keith-johnstone/11721204 or Libgen)
[x] [Closure on the data journey begun in the first class on COVID] https://www.wsj.com/articles/covid-origin-china-lab-leak-807b7b0a
[x] [History of Computing and Data Thinking] Read the Epilogue of Computer Age Statistical Inference by Hastie et al., https://hastie.su.domains/CASI_files/PDF/casi.pdf
[x] [Basics of GitHub] https://github.com/onefact/github-starter-course
Systems Thinking and Incentives:
- [x] [Context on how big 4 consulting firms may/may not be replaced by AI] https://www.lrb.co.uk/the-paper/v44/n24/laleh-khalili/in-clover (https://archive.ph/yC1YQ)
- [x] [High finance & geopolitics] https://www.theguardian.com/world/2016/jul/28/1mdb-inside-story-worlds-biggest-financial-scandal-malaysia
- [x] [High finance & geopolitics] https://www.bloomberg.com/news/articles/2014-08-21/mugabes-bailout-och-ziff-investment-linked-to-zimbabwe-despot#xj4y7vzkg?leadSource=uverify%20wall
- [x] [Predatory Investing] https://www.newyorker.com/magazine/2018/08/27/paul-singer-doomsday-investor
- [x] [Global Geopolitics of AI] Chapter 1 - China's Sputnik Moment - and Chapter 4 - A Tale of Two Countries - of Kai-Fu Lee's AI Superpowers (https://bookshop.org/p/books/ai-superpowers-china-silicon-valley-and-the-new-world-order-kai-fu-lee/6960836 or Libgen)

Doing

[x] Edit https://en.wikipedia.org/wiki/Data_thinking using the learner definitions of data thinking
[x] Try playing one round of https://lost-at-sql.therobinlord.com/
[x] Install https://marketplace.visualstudio.com/items?itemName=charliermarsh.ruff and read the documentation, and activate the extension for python files opened in Visual Studio Code: https://beta.ruff.rs/docs/tutorial/
[x] Red-teaming Previous Homework:
- [x] [Set timer: 10 minutes maximum] Before asking GPT (to avoid biasing yourself!), write your own critique of your homework. Questions to consider could be: what could be improved? What doesn’t make sense in the visualization? What doesn’t make sense in the writing?
- [x] [Set timer: 10 minutes maximum] Ask GPT-4 to critique the homework or the visual using your favorite data thinking definition we have so far.
- [x] [Set timer: 10 minutes maximum] Add this critique as a comment on the homework’s github issue, and link to the critique in Zulip.
- [x] [Set timer: 10 minutes maximum] Repeat this exercise for the previous homework of one other person in the class

Creating

[x] Generate art or diagrams or logos using https://beta.dreamstudio.ai/generate and post on Zulip chat
[x] Use https://poly.cam/ to scan an object or location of your choosing; share the link on Zulip (bonus: feel free to try BlenderGPT: https://github.com/gd3kr/BlenderGPT to modify the 3D scan)
[x] Use duckdb to load the Zulip data into a SQL database, and use altair to visualize the data, following https://github.com/onefact/datathinking.org-codespace/blob/main/notebooks/in-class-notebooks/230420-debugging-duckdb-altair-falcon-3-1-1-service-requests.ipynb (run this notebook with this data: https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9 - and try changing the data source to be the Zulip data and post a visualization of the chat data on Zulip)

Thinking

[x] In a comment on this issue, propose a research question that you will aim to answer the upcoming final homework (and cross-post it on Zulip for feedback); include a rationale for why this topic matters to you, and decisions you need to make, and a clear articulation of the stakes. This can be a paragraph or several. Perhaps start with your experience report from the first homework.

Listening

[x] Listen to this podcast with Emily Bender: https://open.spotify.com/episode/2ILGlkAXAt4xfuKHwIV2on

Large Language Model Access Checklist

[x] Try GPT-4 at https://chat.openai.com/chat (ask on Zulip if you need access)
[x] Try https://serge.chat/
[x] Try https://bard.google.com
[x] Try https://console.anthropic.com/claude (ask on Zulip if you have not received an invite)
[x] Try https://www.phind.com/
[x] Try Copilot (https://github.com/features/copilot; free with GitHub Education Pack)
[x] Try Bing Chat (https://bing.com/chat)
[x] Try Huggingface Chat (hf.co/chat)

artu134 commented 1 year ago

My Analysis on my report:

After analyzing my report, I understood that I’ve needed to spend more time on data research rather than just trying to predict anything from that data. Using logistic regressions just to predict the sender id might not be a good approach. Rather than that I’d better use RNN model that would be trained on this data to predict the style of the writings of every specific user and then that model would be more accurate in predicting which user wrote those messages. Also I needed to be more scientifically correct when writing my report, add more categories and data plotting.
But, overall it’s a pretty good try of using ML methods on the raw chat data.

Chat gpt answer:

This research conducted an analysis of Zulip chat data with the aim of predicting the sender ID based on text content. The approach included preprocessing the text data, converting it into numerical format using Word2Vec embeddings, and training logistic regression and linear regression models. While this research is quite comprehensive, there are areas where it could be potentially improved:

Text Preprocessing: The text preprocessing step only includes tokenization. It could be improved by including other techniques such as lemmatization (converting words to their base form), removing punctuation, converting all text to lower case, and possibly handling emoticons and other special characters which are common in chats.
Feature Engineering: It appears that the features used for prediction are solely based on Word2Vec embeddings. More features could be extracted from the text that may be helpful in predicting the sender ID. These could include message length, use of specific words or phrases, use of punctuation, time of message, etc.
Model Selection: The use of logistic and linear regression is a good starting point, but there are many other models that might give better performance. For text classification tasks, models like Support Vector Machines (SVM), Random Forests, Naive Bayes, or even deep learning architectures like Convolutional Neural Networks (CNN) or Recurrent Neural Networks (RNN) could potentially yield better results.
Hyperparameter Tuning: The research does not mention if any hyperparameter tuning was performed. Tuning the model parameters can often lead to improved performance.
Evaluation Metrics: The evaluation of the models seems to be based on accuracy (for logistic regression) and mean squared error (for linear regression). While these are valid metrics, they might not give a complete picture of the model's performance. Other metrics such as Precision, Recall, F1 score or AUC-ROC (for classification tasks) could provide a more comprehensive view of model performance.
Cross Validation: Using a single train-test split has the risk of overfitting to the test set. It would be better to use a technique like k-fold cross validation to ensure that the model generalizes well to unseen data.
Interpretability: While t-SNE visualizations provide some insights, it might be helpful to include an analysis of feature importances (which features are most predictive) to better understand why the models are making certain predictions.
Multi-word Embeddings: The research only creates embeddings for the first word of each message. Using techniques to create embeddings for entire sentences or documents, like Doc2Vec or sentence transformers, might improve performance and provide better representation of the text data. By addressing these points, the research could potentially improve its methodology and results.

artu134 commented 1 year ago

Proposed Research Question: How can we leverage data science techniques to identify and protect against social engineering threats posed by chatbots?

Rationale: In the era of digital communication, chatbots have become a pervasive part of online interactions. They have been used for a variety of purposes, ranging from customer service to user engagement. However, alongside their beneficial uses, chatbots also pose a significant risk in terms of social engineering attacks. Social engineering is a method of tricking individuals into disclosing confidential information, and chatbots provide an effective and scalable way for malicious actors to carry out such attacks. Therefore, understanding how to identify and mitigate these threats is crucial for personal and collective digital security.

Decisions to be Made: This research will require a variety of decisions to be made. First, we need to decide on the dataset to be used - this could either be an existing dataset of chatbot interactions or a dataset we collect ourselves. Next, we need to decide on the techniques to be used to analyze the data. This could involve machine learning to identify patterns typical of chatbots, natural language processing to understand the linguistic nuances of chatbot interactions, or a combination of both. We also need to consider ethical implications, such as privacy concerns when collecting and analyzing chat data.

Stakes: The stakes for this research are high. As chatbots become more sophisticated and indistinguishable from human interaction, the potential for their misuse in social engineering attacks also rises. By identifying the characteristics of these chatbots, we can help inform individuals and organizations about the risks and provide strategies for protection. Additionally, our findings could be used to improve the design and regulation of chatbots to prevent their misuse. Ultimately, this research could contribute to safer online spaces for everyone.

onefact / datathinking.org

[homework: asking, writing, thinking, doing, 🔴 red-teaming 😈] Critique what you have built; Real-World Data; Historiography of Data, Incentives, & AI #158

Reading

Doing

Creating

Thinking

Listening

Large Language Model Access Checklist