onefact / datathinking.org

Data Thinking website deployed using GitHub Pages
https://datathinking.org
Apache License 2.0
7 stars 7 forks source link

[homework: doing, reading, watching] Linear, logistic regressions and embedding visualizations of Zulip data #119

Open Donce530 opened 1 year ago

Donce530 commented 1 year ago

Doing

Reviewing

Reading

Watching

(message Jaan if you need a VPN or these links don't work)

Donce530 commented 1 year ago

Zulip chat data report.pdf

Donce530 commented 1 year ago

Red Teaming

My own homework critique (10 mins):

The abstract doesn’t say anything about expectations for the research results - it’s too short, and you need to read more to understand the goal.

When discussing data cleaning, I talk about using df.describe() function, but don’t explain what df is, where it comes from etc. More explanation about the methodology and libraries used would be nice.

In some parts, I talk about what exact functions I used and partly show the code, but in others, I just say I did x or y. The approach to explaining what’s going on should be standardized.

I forgot to add the mathematical equations for linear and logistic regression.

Figures are not mentioned in the text, so when reading you have to go and find the correct figure manually.

Typo in TfIdfVectorizer class name - should re-read parts like class names (which are not detected by grammar-checking tools manually) before submission.

I used logistic regression for multiple classes but did not explain how multiclass logistic regression works.

My homework critique from chatGPT (I Don’t have access to GPT-4 yet):

Overall, your report is well-organized and provides a clear overview of the approach and methods used to analyze the Zulip chat data. However, there are a few areas that could be improved:

Lack of clarity in the research question: While you mention that the purpose of the analysis is to learn the process of data thinking, it is not entirely clear what research question or objective you are trying to answer with your analysis. This could be made more explicit in the introduction.

Lack of context: It would be helpful to provide some context about the Data Thinking course and the Zulip chat data, such as the purpose of the course and who the participants are. This would help the reader better understand the significance of the analysis and the relevance of the results.

Lack of detail in the methods: While you briefly describe the methods used for linear regression and embeddings, there is not much detail provided about how these models were implemented or how the results were interpreted. Providing more detail about the specific methods and approaches used would help the reader better understand the analysis.

Lack of discussion of limitations and potential biases: It would be useful to include a discussion of any limitations or potential biases in the analysis, such as limitations in the data or assumptions made in the models. This would help the reader better understand the reliability and generalizability of the results.

Inconsistent use of language: There are a few instances where the language used in the report is informal or unclear. For example, in the introduction, you use the phrase "with the help(?) of large language models," which is not entirely clear. Using more formal and consistent language throughout the report would improve its clarity and readability.

Overall, your report provides a good overview of the process and results of your analysis, but there is room for improvement in terms of clarity, detail, and consistency.

nesmaAlmoazamy commented 1 year ago

1- The report is well-organized and written. 2- I feel there was a nice effort spent on the data preparation and data thinking 3- I like the way of thinking to split the data into different target classes to use in the regression. 4- The correlation between Zulip message length and time of day is an interesting way to analyze the data. 5- Figure 2, which shows message embeddings projected onto a 2D plot using PCA, is too small for me to understand the idea clearly. 6- Figure 4, which visualizes the performance of the logistic regression model, appears to have some issues. This could be due to data bias, as the decision surfaces for multinomial logistic regression are not visible. Finally, Good job and great work :)

beijn commented 1 year ago

I inputted the datathinking wikipedia article preceded by the prompt "Memorize this definition of Datathinking. I will ask you questions about it later.". Then I inputted a text only version (aquired using pdftotext) preceded by the prompt "Critize the following report using the definition of Datathinking". The results are below. View the conversation here


The given report demonstrates an attempt to analyze Zulip chat data from the Data Thinking course at the University of Tartu. While the report provides some insights and results, there are several areas where it could be criticized based on the definition of Data Thinking:

Lack of clear research question: The report states that the goal is not necessarily to find precise conclusions but rather to learn the process of data thinking. However, it is essential in any data analysis to have a clear research question or objective to guide the analysis and interpretation of results. Without a specific question, the analysis may lack focus and fail to provide meaningful insights.

Inadequate data cleaning explanation: The report mentions the removal of certain columns based on their perceived lack of value. However, it does not provide a comprehensive explanation or justification for these decisions. It is important to clearly outline the criteria used for data cleaning and provide reasoning behind the removal of specific columns. This helps ensure transparency and reproducibility of the analysis.

Insufficient exploration of alternative models: The report solely focuses on linear regression and logistic regression models without exploring other potential models that could be more appropriate for the given data. Data thinking involves considering various modeling techniques and selecting the most suitable one based on the data characteristics and research question. The report does not justify why linear and logistic regression were chosen over other alternatives.

Limited evaluation of results: The report briefly presents the results of the linear regression and logistic regression models but does not provide a comprehensive evaluation or interpretation of these results. It is crucial to critically analyze and interpret the model outputs, assess their performance, and discuss any limitations or uncertainties. This would enhance the validity and reliability of the analysis.

Inadequate discussion of visualization choices: The report mentions attempts to visualize the results using different techniques but fails to provide a thorough discussion of the visualization choices made. Data thinking involves selecting appropriate visualizations to effectively communicate insights and patterns in the data. It is important to explain the rationale behind the chosen visualizations and discuss their effectiveness in conveying the intended message.

Limited reflection on the data thinking process: While the report briefly touches upon the value and principle of data thinking, it lacks a comprehensive reflection on the entire data thinking process. Data thinking involves iterative exploration, questioning, and critical analysis of the data. The report could benefit from a more in-depth reflection on the challenges faced, lessons learned, and areas for improvement in the data thinking process.

Overall, the report demonstrates some efforts in analyzing the Zulip chat data but falls short in certain aspects of data thinking, such as the formulation of research questions, comprehensive data cleaning explanation, exploration of alternative models, evaluation of results, discussion of visualization choices, and reflection on the data thinking process. Improvements in these areas would strengthen the analysis and make it more aligned with the principles of data thinking.