onefact / datathinking.org

Data Thinking website deployed using GitHub Pages
https://datathinking.org
Apache License 2.0
7 stars 7 forks source link

[homework: asking, writing, thinking, doing, 🔴 red-teaming 😈] Critique what you have built; Real-World Data; Historiography of Data, Incentives, & AI #158

Closed artu134 closed 1 year ago

artu134 commented 1 year ago

Reading

Pro tip: try using an app on your phone or computer to read aloud to you at 1.5x speed! This can save time and make it easier to absorb information while not being tied down to a computer or device visually.

Doing

Creating

Thinking

Listening

Large Language Model Access Checklist

artu134 commented 1 year ago

My Analysis on my report:

After analyzing my report, I understood that I’ve needed to spend more time on data research rather than just trying to predict anything from that data. Using logistic regressions just to predict the sender id might not be a good approach. Rather than that I’d better use RNN model that would be trained on this data to predict the style of the writings of every specific user and then that model would be more accurate in predicting which user wrote those messages. Also I needed to be more scientifically correct when writing my report, add more categories and data plotting.
But, overall it’s a pretty good try of using ML methods on the raw chat data.

Chat gpt answer:

This research conducted an analysis of Zulip chat data with the aim of predicting the sender ID based on text content. The approach included preprocessing the text data, converting it into numerical format using Word2Vec embeddings, and training logistic regression and linear regression models. While this research is quite comprehensive, there are areas where it could be potentially improved:

  1. Text Preprocessing: The text preprocessing step only includes tokenization. It could be improved by including other techniques such as lemmatization (converting words to their base form), removing punctuation, converting all text to lower case, and possibly handling emoticons and other special characters which are common in chats.
  2. Feature Engineering: It appears that the features used for prediction are solely based on Word2Vec embeddings. More features could be extracted from the text that may be helpful in predicting the sender ID. These could include message length, use of specific words or phrases, use of punctuation, time of message, etc.
  3. Model Selection: The use of logistic and linear regression is a good starting point, but there are many other models that might give better performance. For text classification tasks, models like Support Vector Machines (SVM), Random Forests, Naive Bayes, or even deep learning architectures like Convolutional Neural Networks (CNN) or Recurrent Neural Networks (RNN) could potentially yield better results.
  4. Hyperparameter Tuning: The research does not mention if any hyperparameter tuning was performed. Tuning the model parameters can often lead to improved performance.
  5. Evaluation Metrics: The evaluation of the models seems to be based on accuracy (for logistic regression) and mean squared error (for linear regression). While these are valid metrics, they might not give a complete picture of the model's performance. Other metrics such as Precision, Recall, F1 score or AUC-ROC (for classification tasks) could provide a more comprehensive view of model performance.
  6. Cross Validation: Using a single train-test split has the risk of overfitting to the test set. It would be better to use a technique like k-fold cross validation to ensure that the model generalizes well to unseen data.
  7. Interpretability: While t-SNE visualizations provide some insights, it might be helpful to include an analysis of feature importances (which features are most predictive) to better understand why the models are making certain predictions.
  8. Multi-word Embeddings: The research only creates embeddings for the first word of each message. Using techniques to create embeddings for entire sentences or documents, like Doc2Vec or sentence transformers, might improve performance and provide better representation of the text data. By addressing these points, the research could potentially improve its methodology and results.
artu134 commented 1 year ago

Proposed Research Question: How can we leverage data science techniques to identify and protect against social engineering threats posed by chatbots?

Rationale: In the era of digital communication, chatbots have become a pervasive part of online interactions. They have been used for a variety of purposes, ranging from customer service to user engagement. However, alongside their beneficial uses, chatbots also pose a significant risk in terms of social engineering attacks. Social engineering is a method of tricking individuals into disclosing confidential information, and chatbots provide an effective and scalable way for malicious actors to carry out such attacks. Therefore, understanding how to identify and mitigate these threats is crucial for personal and collective digital security.

Decisions to be Made: This research will require a variety of decisions to be made. First, we need to decide on the dataset to be used - this could either be an existing dataset of chatbot interactions or a dataset we collect ourselves. Next, we need to decide on the techniques to be used to analyze the data. This could involve machine learning to identify patterns typical of chatbots, natural language processing to understand the linguistic nuances of chatbot interactions, or a combination of both. We also need to consider ethical implications, such as privacy concerns when collecting and analyzing chat data.

Stakes: The stakes for this research are high. As chatbots become more sophisticated and indistinguishable from human interaction, the potential for their misuse in social engineering attacks also rises. By identifying the characteristics of these chatbots, we can help inform individuals and organizations about the risks and provide strategies for protection. Additionally, our findings could be used to improve the design and regulation of chatbots to prevent their misuse. Ultimately, this research could contribute to safer online spaces for everyone.