kvarada / DSI-ML-workshop-2024

Introductory machine learning workshop for non-stem audience.
https://kvarada.github.io/DSI-ML-workshop-2024
Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

Add activity reference material v1 #4

Closed tonyshumlh closed 3 weeks ago

kvarada commented 3 weeks ago

@yuliaUU, Just so you know, this is for the afternoon lab session. From the survey, it seems there’s a lot of interest in framing machine learning problems. Some participants will bring their own datasets and research problems. These materials are for those who haven’t brought their own. If you have some time, it would be great if you could contribute to this.

yuliaUU commented 3 weeks ago

@kvarada thank you! Are we still doing 2 sections in teh afternoon section? for those who do know how to code and those who does not? Or we merge everyone together?

kvarada commented 3 weeks ago

@yuliaUU Based on the survey results, I think it makes sense to have just one section focused on ML problem framing. That seems like the most useful approach for this audience. Some people will bring their own problems, while others will work on the datasets we provide. We’ll offer guiding questions (like the ones mentioned above) to help with ML problem framing. What do you think?

kvarada commented 3 weeks ago

@tonyshumlh and @yuliaUU FYI: The above write-up for the lab session will go here in the GitHub repo and here on our website.

yuliaUU commented 3 weeks ago

@kvarada yes, better to keep things simple! i like the idea that students can bring their dataset and get help with discussing what can work best for their data

tonyshumlh commented 3 weeks ago

ML Problem Framing Example

Data

Framing

  1. Is the provided dataset appropriate for the specified objective? What type of data would ideally solve your problem or research question? Are there better-suited datasets available for this objective?

    • The ideal dataset should contain an indicator whether it is a fraud transaction which is necessary to apply supervised machine learning methods. It should also contains data about the transactions, e.g. transaction details, customer details, etc., which could be used to predict the fraud transaction.
    • The data contains target column 'Class' that indicates whether it is a fraud transaction. It also contains other columns ('Time', 'Amount', 'V1-V28') which can be used as factors to predict the target column 'Class'.
    • There are some caveats for using the dataset
    • The dataset is a transformed dataset with PCA-transformed columns V1 - V28. Without the original dataset, we are not sure how the PCA transformation is performed. We cannot guarantee if the same transformation can be applied to any new data (e.g. real-world unseen data)
    • From interpretation perspective, we do not understand the meaning of the columns V1 - V28, which we fail to draw any meaningful conclusion betwen the factors and the target. Moreover, we might not detect any bias or unfairness among the ML model if the original dataset contains sensitive information, e.g. age.
    • {better dataset's URL}
  2. Clearly define the expected input and the 'ideal' output. Determine if machine learning is the appropriate method for addressing this problem.

    • Data Input: Data about the transactions, e.g.
    • merchandise details: time, amount, merchant information, etc.
    • customer details: account balance, cardholder demographics, etc.
    • Data Output: A soft prediction (probability) whether a transaction is a fraud
    • As we can clearly define the expected input and output and we can collect the data, machine learning is the appropriate method to address the problem
  3. If machine learning is deemed suitable, what should the model aim to achieve? How would you measure the model's performance?

    • Objective: detect fraudulent transaction by a soft prediction (probability) and a threshold to determine whether a transaction is a fraud
    • We can expect there would be potential false positives (clean transactions predicted as fraud) and false negatives (fraudulent transactions predicted a clean) from the prediction and the model should balance both based on our objectives. For example, more false negatives might create bad debts, while more false positives might affect the revenue.
    • Moreover, the dataset is an unbalanced dataset with small amount of fraud records.
    • We can use evaluation metrics, e.g. precision, recall, area under precision-recall curve, etc. to measure the model performance (e.g. how accurate the predictions are, how balanced between false positives and false negatives, etc.)
  4. How would a human tackle this issue? Can you propose any heuristic methods to solve this problem?

  5. What are the major steps required to resolve this problem?

  1. Draw a diagram that illustrates the input, output, and key stages of the problem-solving process.

    • {TBC}
  2. Which type of machine learning would be best suited for your problem?

  3. What specific machine learning technique would be most effective for this problem?

    • Given a clear target with labelled data, supervised machine learning will be best suited.
    • Simple linear regression can be first applied to observe any linear relationship between the target and the factors and obtain a baseline performance.
    • Given the possibility of non-linear relationship, non-linear algorithm, e.g. SVM, can be applied to see if any improvement can be obseved.
    • Given the variety of data types (textual, numerical, etc.), decision tree based algorithm can be applied to see if any improvement can be obseved.
kvarada commented 3 weeks ago

@tonyshumlh

Thanks for putting this together. I'm not sure whether it's ready for my review yet but I would also describe a scenario and your objective. Something as shown below:

Imagine you work at a bank, and the current fraud detection model isn't performing well. Your boss asks you to explore machine learning approaches to improve the detection and flagging of fraudulent credit card transactions. While researching online, you find Credit Card Fraud Detection dataset on Kaggle that could be useful for creating a prototype.

And then ask them to brainstorm the provided questions

tonyshumlh commented 3 weeks ago

@kvarada I agree. I have refined the problem framing example in workshop-12.qmd and added your questions in workshop-10.qmd I would continue to consolidate the example topics + dataset in workshop-11.qmd

tonyshumlh commented 3 weeks ago

@kvarada @p-bajpai @p-bajpai Added the example topics and datasets. Great if you can review the content and the layout (there is some issue with my Quarto preview). Thank you.