Open henryliangt opened 1 year ago
classes unknown,
classes are predefined
Tumour / No tumour Winning party from two in an election. credit card fraud
Can select your own dataset
Can be found in a repository
UCI Machine learning repository
Kaggle datasets
Your own workplace
Others, can be creative.
Not recommended data requires a lot of feature engineering and pre-processing. Image Time Series
10,000 + rows messy: lots of missing. complex: 100+ features, mix of numeric and categorical features.
Combining different data sources.
Multi-class classification
Describe how this is a classification problem
Provide some context about why this problem is interesting
Describe the data (how many samples are there, are the columns of the data numeric, categorical)
What challenges do you envision for your dataset (lots of missing values, high-dimensional data, etc.)
Describe how you plan to evaluate your classification model
outliers
missing values,
data cleaning
high dimension: PCA
Histogram/density estimation.
3 pages, in pdf format,(submit in HTML)?
compile to pdf to check length. submit in HTML.??
A 10/10 report will have An interesting, well defined problem Uses a complex dataset Well written report Well thought out, achievable project plan Appropriate choice of evaluation metric(s) Description of any data cleaning and data wrangling Visualisation of all the relevant features in the dataset Used the appropriate type of plot for the data that is being explored. Report includes excellent explanations of what the plots mean
20 seconds/slide (auto play) = 6m 40s
pre-recorded + live talk +Q&A session + transition = 12minutes
6-10 pages submit in HTML
Overview of the problem
Dataset description
Initial data analysis/visualization of the data
Feature engineering
Classification algorithms used
Classification performance evaluation
Conclusion
Used all the data in the prediction or argued a very good case on why some data should be removed.
Performed the appropriate feature engineering and/or dimensional reduction.
Tried at least 4 classification algorithms.
For classification algorithms requiring parameters, performed parameter tuning.
Report clearly described methods used and the results obtained.
Correctly evaluated the different classification models and consider more than one performance metrics.
Report clearly described the model comparison.
Presentation, and any references to the report, were clear and appropriate.
Context Ad Clicks Dataset: train.csv view_log.csv item_data.csv
item = 13.2k rows X 6 col (item_id | item_price | category_1 | category_2 | category_3 | product_type)
train = 23.7k rows X 6 col ( impression_id | impression_time | user_id | app_code | os_version | is_4G | is_click )
view_log = 1million rows x 6 col (server_time | device_type | session_id | user_id | item_id)
3 sheets combine, 27 citation,
Will users click the ads?
What is the association relationship between users and items? What is the browsing pattern among user types? (phone, time ... ) Is there any time preference pattern of certain items? What other users only not click only impression ?
Walmart return dataset
What is the driving force of return reason? Multi-class classification task. What will be the next user's return reason.
Are the returns randomly distributed over time, or spatial? If yes, when or where ist the concentrated place? is return associated with items, discount, or price? shipping time? or combinations?
One hot.
46.3k rows X 15 col ( session_id | DateTime | user_id | product | campaign_id | webpage_id | product_category_1 | product_category_2 | user_group_id | gender | age_level | user_depth | city_development_index | var_1 | is_click)
test (128858, 14). Features are clear and target is "is_click" , 0 (No) , 1(Yes).
missing value.
whether spending their money on digital advertising is worth or not. A higher CTR represents more interest in that specific campaign, whereas a lower CTR can show that the ad may not be as relevant.
binary classification task, click or not.
What type of audience we will pay click ads?
what type is for impression ads.
Ad Display/Click Data on Taobao.com
raw_sample 1.1million rows X 5 columns ( (1) user: User ID(int); (2) time_stamp: time stamp(Bigint, 1494032110 stands for 2017-05-06 08:55:10); (3) adgroup_id: adgroup ID(int); (4) pid: scenario; (5) noclk: 1 for not click, 0 for click;
ad_feature
(1) adgroup_id:Ad ID(int) ;
(2) cate_id:category ID;
(3) campaign_id:campaign ID;
(4) brand:brand ID;
(5) customer_id: Advertiser ID;
One of the ad ID corresponds to an item, an item belongs to a category, an item belongs to a brand.
user_profile 1 million user profiles of 1) userid: user ID; (2) cms_segid: Micro group ID; (3) cms_group_id: cms_group_id; (4) final_gender_code: gender 1 for male , 2 for female (5) age_level: age_level (6) pvalue_level: Consumption grade, 1: low, 2: mid, 3: high (7) shopping_level: Shopping depth, 1: shallow user, 2: moderate user, 3: depth user (8) occupation: Is the college student 1: yes, 0: no? (9) new_user_class_level: City level
(1) nick: User ID(int); (2) time_stamp: time stamp(Bigint, 1494032110 stands for 2017-05-06 08:55:10); (3) cate: category ID(int);
Will a user click this ads or not? What are the top factors to target the high potential users? How many user profiles could you group? How many brands could you cluster? Is there any two brands should share their user data for lower cost ?
[Shopping Mall Paid Search Campaign Dataset]
paid click ads price, cost data
CTR Prediction - 2022 DIGIX Global AI Challenge
922Mb, nice
AD CLICK PREDICTION 40kb
John Wanamaker (1838-1922), department-store magnate, once said, “Half the money I spend on advertising is wasted; the trouble is, I don't know which half.”
50% of Your Advertising Budget is Wasted (And What You Can Do About It!) Over 100 years ago, John Wanamaker, a pioneer in the field of marketing, opined that half of his advertising budget was most likely wasted
Organization:
6 people - 6 datasets: 1 person writing and r, answer the guideline questions. - slides.
6 people - 3 datasets. One person breakdown the guideline questions and writing + one person R.
I will take care of the R + writing. I will provide the literature review and citation .
over all writing, format,
2 person cooperate:
A: guideline->question steps ->B code -> A writing and questions -> writing -> submit
back to back double-check system, to prevent misunderstanding of the task, and remove possible errors.
combine datasets
Goal:
solve a classification problem by course techniques.
2 stages
Plan & EDA ( 10%, week 7 )
Presentation (10%)
Final Report (15%)
Sunday evening