henryliangt commented 1 year ago

Goal:

solve a classification problem by course techniques.

2 stages

Plan & EDA ( 10%, week 7 )

Dataset
EDA
Initial DA describes the dataset

Presentation (10%)

Detailed report: dataset, problem, statistics method, conclusion.
Video Presentation: key findings

Final Report (15%)

Sunday evening

henryliangt commented 1 year ago

Types of learning problems

Unsupervised

classes unknown,

Supervised

classes are predefined

labels are discrete = classification problem

Two class classification

Tumour / No tumour Winning party from two in an election. credit card fraud

multi-class prediction

Identify species based on images

Diagnosis of a patient based on symptoms

labels are continuous, regression problem.

henryliangt commented 1 year ago

Can select your own dataset

Can be found in a repository
UCI Machine learning repository
Kaggle datasets
Your own workplace
Others, can be creative.

Not recommended data requires a lot of feature engineering and pre-processing. Image Time Series

10,000 + rows messy: lots of missing. complex: 100+ features, mix of numeric and categorical features.

Combining different data sources.

Multi-class classification

henryliangt commented 1 year ago

D1 IDA

Overview of the problem

Describe how this is a classification problem
Provide some context about why this problem is interesting

Dataset description

Describe the data (how many samples are there, are the columns of the data numeric, categorical)
What challenges do you envision for your dataset (lots of missing values, high-dimensional data, etc.)

Are there features that are highly correlated

Evaluation metrics that are planned to be used

Describe how you plan to evaluate your classification model

henryliangt commented 1 year ago

Visualize

outliers
missing values,
data cleaning
high dimension: PCA
Histogram/density estimation.
3 pages, in pdf format,(submit in HTML)?
compile to pdf to check length. submit in HTML.??

henryliangt commented 1 year ago

Deliverable One: Marking rubric

A 10/10 report will have An interesting, well defined problem Uses a complex dataset Well written report Well thought out, achievable project plan Appropriate choice of evaluation metric(s) Description of any data cleaning and data wrangling Visualisation of all the relevant features in the dataset Used the appropriate type of plot for the data that is being explored. Report includes excellent explanations of what the plots mean

henryliangt commented 1 year ago

Deliverable 2: Final presentation and report Week 12, 12 mins

summaries your key finds succinctly

communicate your ideas effectively

broad audience, no statistics /ML

Elevator pitch key findings

20 Slides

20 seconds/slide (auto play) = 6m 40s

ppt auto play

markdown: output autoplay option, Units=milliseconds, YAML

output: xaringan::moon_reader: nature: autoplay: 20000

pre-recorded + live talk +Q&A session + transition = 12minutes

henryliangt commented 1 year ago

Final Report R markdown -> pdf/html 25%

6-10 pages submit in HTML

Overview of the problem
Dataset description
Initial data analysis/visualization of the data
Feature engineering
Classification algorithms used
Classification performance evaluation
Conclusion

henryliangt commented 1 year ago

rubric

Used all the data in the prediction or argued a very good case on why some data should be removed.
Performed the appropriate feature engineering and/or dimensional reduction.
Tried at least 4 classification algorithms.
For classification algorithms requiring parameters, performed parameter tuning.
Report clearly described methods used and the results obtained.
Correctly evaluated the different classification models and consider more than one performance metrics.
Report clearly described the model comparison.
Presentation, and any references to the report, were clear and appropriate.

henryliangt commented 1 year ago

Context Ad Clicks Dataset: train.csv view_log.csv item_data.csv

view_log = 1million rows x 6 col (server_time | device_type | session_id | user_id | item_id)

3 sheets combine, 27 citation,

Will users click the ads?

What is the association relationship between users and items? What is the browsing pattern among user types? (phone, time ... ) Is there any time preference pattern of certain items? What other users only not click only impression ?

henryliangt commented 1 year ago

Walmart return dataset

What is the driving force of return reason? Multi-class classification task. What will be the next user's return reason.

Are the returns randomly distributed over time, or spatial? If yes, when or where ist the concentrated place? is return associated with items, discount, or price? shipping time? or combinations?

One hot.

henryliangt commented 1 year ago

CTR In Advertisement

test (128858, 14). Features are clear and target is "is_click" , 0 (No) , 1(Yes).

missing value.

whether spending their money on digital advertising is worth or not. A higher CTR represents more interest in that specific campaign, whereas a lower CTR can show that the ad may not be as relevant.

binary classification task, click or not.
What type of audience we will pay click ads?
what type is for impression ads.

henryliangt commented 1 year ago

Ad Display/Click Data on Taobao.com

raw_sample 1.1million rows X 5 columns ( (1) user: User ID(int); (2) time_stamp: time stamp(Bigint, 1494032110 stands for 2017-05-06 08:55:10); (3) adgroup_id: adgroup ID(int); (4) pid: scenario; (5) noclk: 1 for not click, 0 for click;

ad_feature
(1) adgroup_id：Ad ID(int) ; (2) cate_id：category ID; (3) campaign_id：campaign ID; (4) brand：brand ID; (5) customer_id: Advertiser ID; One of the ad ID corresponds to an item, an item belongs to a category, an item belongs to a brand.

user_profile 1 million user profiles of 1) userid: user ID; (2) cms_segid: Micro group ID; (3) cms_group_id: cms_group_id; (4) final_gender_code: gender 1 for male , 2 for female (5) age_level: age_level (6) pvalue_level: Consumption grade, 1: low, 2: mid, 3: high (7) shopping_level: Shopping depth, 1: shallow user, 2: moderate user, 3: depth user (8) occupation: Is the college student 1: yes, 0: no? (9) new_user_class_level: City level

(1) nick: User ID(int); (2) time_stamp: time stamp(Bigint, 1494032110 stands for 2017-05-06 08:55:10)； (3) cate: category ID(int);

Will a user click this ads or not? What are the top factors to target the high potential users? How many user profiles could you group? How many brands could you cluster? Is there any two brands should share their user data for lower cost ?

henryliangt commented 1 year ago

ads click

henryliangt commented 1 year ago

Replication data for: What Makes Them Click: Empirical Analysis of Consumer Demand for Search Advertising

henryliangt commented 1 year ago

[Shopping Mall Paid Search Campaign Dataset]

paid click ads price, cost data

henryliangt commented 1 year ago

CTR Prediction - 2022 DIGIX Global AI Challenge

922Mb, nice

henryliangt commented 1 year ago

AD CLICK PREDICTION 40kb

henryliangt commented 1 year ago

Ad Click Prediction - Classification Problem Use this dataset to predict whether customer will click Ad and make a purchase

henryliangt commented 1 year ago

ADVERTISEMENT CLICK PREDICTION ADVERTISEMENT CLICK PREDICTION USING VARIOUS CLASSIFICATION MODELS

329k

henryliangt commented 1 year ago

Effective Targetting of Advertisments Your brand is a story unfolding across all customer touch points. 39k

henryliangt commented 1 year ago

Effective Targeting of Advertisements Maximizing ROI through precision audience targeting 39k

henryliangt commented 1 year ago

Sales Conversion Optimization How to Cluster Customer data for campaign marketing 18k, 60k

henryliangt commented 1 year ago

A Right Media Mix Can Make the Difference

henryliangt commented 1 year ago

John Wanamaker (1838-1922), department-store magnate, once said, “Half the money I spend on advertising is wasted; the trouble is, I don't know which half.”

50% of Your Advertising Budget is Wasted (And What You Can Do About It!) Over 100 years ago, John Wanamaker, a pioneer in the field of marketing, opined that half of his advertising budget was most likely wasted

henryliangt commented 1 year ago

Organization:

6 people - 6 datasets: 1 person writing and r, answer the guideline questions. - slides.

6 people - 3 datasets. One person breakdown the guideline questions and writing + one person R.

I will take care of the R + writing. I will provide the literature review and citation .

over all writing, format,

henryliangt commented 1 year ago

2 person cooperate:

A: guideline->question steps ->B code -> A writing and questions -> writing -> submit

henryliangt commented 1 year ago

back to back double-check system, to prevent misunderstanding of the task, and remove possible errors.

henryliangt commented 1 year ago

speed dating datasets

combine datasets

https://journals-sagepub-com.libezproxy.must.edu.mo/doi/pdf/10.1002/per.768?casa_token=zQj8APInS7IAAAAA:SKBLwPuV9k2a-ykZFWfJrolJslG2XsY15AvSUEYONSi3ypSsghvJKlD8x0efZ-LQcpTpC29A4tGe4Q

https://direct.mit.edu/rest/article-abstract/96/3/444/58173/Contrast-Effects-in-Sequential-Decisions-Evidence

https://journals.lww.com/co-pediatrics/Abstract/2013/06000/Respiratory_syncytial_virus_and_asthma_.11.aspx

henryliangt / usyd

5003 Group task 35% of total A compare study of advertisment performance #56

Goal:

2 stages

Plan & EDA ( 10%, week 7 )

Presentation (10%)

Final Report (15%)

Types of learning problems

Unsupervised

Supervised

labels are discrete = classification problem

Two class classification

multi-class prediction

Identify species based on images

Diagnosis of a patient based on symptoms

labels are continuous, regression problem.

D1 IDA

Overview of the problem

Dataset description

Are there features that are highly correlated

Evaluation metrics that are planned to be used

Visualize

Deliverable One: Marking rubric

Deliverable 2: Final presentation and report Week 12, 12 mins

summaries your key finds succinctly

communicate your ideas effectively

broad audience, no statistics /ML

Elevator pitch key findings

20 Slides

ppt auto play

markdown: output autoplay option, Units=milliseconds, YAML

output: xaringan::moon_reader: nature: autoplay: 20000

Final Report R markdown -> pdf/html 25%

rubric