Open kgoel59 opened 2 months ago
Big Data Lifecycle
Learn Business Domain: Understand the industry and context. Interview Sponsor & Identify Stakeholders: Engage with key individuals to gather insights. Define Resources & Goals: Establish objectives and available resources. Identify Potential Data Sources: Locate relevant data sources. Frame the Problem & Develop Initial Hypotheses: Formulate hypotheses, including Null Hypothesis (H0) and Alternative Hypothesis (HA or H1).
Prepare Sandbox: Set up an environment for data preparation. Perform ETLT (Extract, Transform, Load, Transform): Process data for analysis. Understand Data Details: Examine the data’s structure and quality. Data Conditioning: Address issues like missing values and outliers. Format Data: Prepare data for analysis. Visualize Data: Use plots to explore data patterns.
Select Variables: Based on relationships (e.g., correlation matrix) and domain knowledge. Identify Candidate Models: Refer to hypotheses, translate into machine learning models, review literature, and document assumptions.
Create Datasets: Prepare training, validation, and testing datasets. Train and Test Models: Evaluate model performance.
Compare Results: Assess against criteria. Articulate Findings: Clearly present results. Discuss Limitations & Recommendations: Provide insights on limitations and suggest improvements.
Deliverables: Finalize and deliver the project. Pilot Project: Test the model in a real-world scenario. Performance & Constraints: Monitor and address any constraints. Training: Educate new users as needed.
Assignment 2 aims to find misinformation on social network, i.e., identify profiles that are mistakenly recorded as human/non-human profiles