michelleg06 / Educational_Outcomes_Ecuador

Educational and Labour Networks
2 stars 1 forks source link

Networks in Ecuador

Educational and Labour Networks

Data

The dataset used can be downloaded at: https://drive.google.com/drive/folders/1V55ahzgc2SWl3GkebYUg814jbh7UQ-7p?usp=sharing. The "data" folder contains two files:

  1. data_ecuador_annual.zip: the raw and unmerged panel data.
  2. ecuador_data.rds: the merged panel data, with an added year column.

Some information on the panel that we have:

We have identified the variables below as most important to our discussion:

Define cluster convention

Clusters

Socioeconomic

Psychosocial (preferences, subjective experience)

Environmental (access to physical and technological infrastructure)

† All cluster values can be relevant, and part of the task at hand is to figure out which ones are best to include, and how to do so.

‡ These are the variables used in our first conceptual model, and which we think are the most relevant ones to start our analysis with.

The code -- in Python

Should follow a similar set-up as the code in R now does. Note that, for now, our variables of interest are those marked with a ‡ above. This means that questions of imputation, modelling, and cleaning, apply chiefly to these variables. As time goes on the list of relevant variables will likely change, also through input from the literature review team.

  1. SaveAndMergePanel.py: This file loads the yearly panel, merges it into a single table, and stores it as a useful format.
    • What would be the best format to use here, for importing in step 1.?
  2. TranslationAndCleaning.py: Should translate the Spanish column names into English, adds some necessary columns, and fixes some typo's. This is where cleaning and defining of new variables takes place.
  3. Imputation.py: Currently empty, but this is where we will impute the required variables.
    • A very important step, and we are unsure how to proceed here. What is the most statistically rigorous way to impute these values? Also requires some input from 3.
  4. GraphingAndDescriptiveStatistics.py: For the generation of graphs and descriptive statistics. The results from this step are likely to influence choices made in steps 2 and 4.
    • Work has to be done to understand what data is missing, and why.
    • Some work also has to be done to understand what levels of clustering are most relevant, and for which variables (e.g., is someone most influenced by the average level of education in their household, neighborhood, or city?) Also a question that can be answered
  5. Modeling.py: Contains the hierarchical mixed model, and some tests associated to its creation. Also contains the more simple fixed effects model based on a small selection of variables. We are aiming for a model that incorporates time effects, clustering effects, and some core explanatory variables.
    • Some technical work can already be done to try various mixed models, and see how they respond to the data.

The code -- in R

Some notes:

Possible next steps

Visualization of distributions of relevant variables

Intracluster correlation

Run a simple hierarchical analysis (multilevel analysis)

Define an interaction matrix

make an edgelist where 1 equals similar activitie between two agents and 0 means no similar activity