Open LillianYKim opened 3 years ago
Comment any ideas/resources below!
Idea 1: Further exploring covid stuff
County-level Socioeconomic Data for Predictive Modeling of Epidemiological Effects https://github.com/JieYingWu/COVID-19_US_County-level_Summaries
Covid Severity Forecasting https://github.com/Yu-Group/covid19-severity-prediction
Real time covid dataset https://www.nature.com/articles/s41597-020-0448-0
Johns Hopkins dataset https://github.com/QFL2020/COVID_DataHub
Idea 2: Reproductive rights
Reproductive health/rights data https://www.cdc.gov/reproductivehealth/data_stats/index.htm
Abortion statistics in UK during pandemic
Can create datasets with all sorts of reproductive right indicators on state-/county-level https://data.guttmacher.org/counties
Information on reproductive rights and accessibility to abortion in US states https://statusofwomendata.org/explore-the-data/reproductive-rights/
Idea 3: Crime Rate Changes During Pandemic
-caveat: might have to specifically look up states' crime rate info
ncbi article on environmental impacts in 2020 vs. previous years https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7459942/
Great topic idea, and motivation! You've identified some excellent questions to explore, and lots of potential data sources -- I fear this may be too ambitious to try to do it all within the time we have, so I would recommend narrowing your focus. Your planned visualizations make sense, but if they all require different data sources, it may be more appropriate to try to focus on one or two of the data sources you mention.
Love the checklist as a way to plan your schedule! Which you can also update and re-adjust as needed going forward.
Excellent planning, team! I really look forward to your blog post.
Update 1: 10/10
REVISED PLAN
I. Direction
We will focus on 3 research questions that are listed below along with their associated datasets:
Data on Air pollution (Bella) https://stats.oecd.org/index.aspx?queryid=72722
Data on Welfare and Public Health effects of environmental pollution (Lillian) https://stats.oecd.org/Index.aspx?DatasetCode=HEALTH_STAT
choropleth (by Bella): leaflet, shiny interactivity to allow the user to drag along the time frame
Infant mortality https://stats.oecd.org/Index.aspx?DatasetCode=HEALTH_STAT
kmeans clustering: interactivity with shiny to allow the user to choose countries/years (should we average infant mortality over time??)
Other Resources
**Not a dataset, but images we could include via a link on the first page of blog to show climate change effects! https://climate.nasa.gov/images-of-change?id=739#739-spring-in-the-kulunda-steppe-russia-kazakhstan-border
Info for blog text later
how climate change disproportionately affects certain areas agriculturally and how that relates to infant mortality https://academic.oup.com/reep/article/12/1/26/4835833
II. Describe your intended final products (this is a rough list and is subject to change)
shiny interactive app (Together) For this product, we plan to allow the user to choose certain environmental variables, public health variables, and regions.
bar/line graphs will further hone in on how different developing vs. developed countries are disproportionately affected by climate-related disasters depending on if they're an agricultural-based society. less favored areas (more effects from climate change) have more prevalent poverty in the environment and infant mortality differences, so we could more starkly compare a chosen group of say 5 countries (versus the 50 displayed on the chloropleth) to note these differences.
scatterplot (Bella) Something I think I could work on is CO2 emission vs GDP by country and time, as an introductory visualization to our blogpost, because other environmental and social/public health are being integrated in the choropleths and the shiny app. This might reveal a more interesting insight at the end of the blogpost project as we will be able to see how relatively more richer countries are affected by those environmental factors compared to how much they contribute to climate change.
III. Schedule
We plan to have zoom meeting on every Tuesday, and synchronously/asynchronously communicate with FB message or github issue on every Thursday to update one another on what progress each of us made. Of course, given the workload and the time given for this project, we expect to communicate outside of regular check-points.
[x] 10/28 Wednesday: decide on topic and/or dataset (Zoom)
[x] 10/29 Thursday: decide on a dataset, discuss final product in detail, write plan (FB message)
[x] Work on data wrangling
[x] 10/30 Friday Plan Update 1
[x] Work on data wrangling / start making viz
[x] 11/3 Tuesday: check if data wrangling is complete, work on data viz/shiny app (Zoom)
[x] 11/5 Thursday: work on data viz/shiny app (FB message / github issue)
[x] 11/6 Friday Status Update 2 : Ideally, should have already started on data viz/shiny app
[x] Work on data viz/shiny app
[x] 11/10 Tuesday: plan to have data viz all prepared (Zoom)
[x] 11/10 Tuesday Status Update 3: Ideally, should have finished creating data viz/shiny app
[x] Write the "report" part of blog
[x] 11/12 Thursday: plan to have main text-portion and data viz all prepared (FB message / github issue)
[x] Write script/practice presentation
[x] 11/17 Tuesday: work on minor revisions and polishing, practice presentation (Zoom)
[x] 11/17 or 19 Presentation
[x] Depending on the date of our presentation, work on last-minute revision
[ ] **11/20 Friday FINAL DUE DATE
Lillian: Annual natural disaster by country https://public.emdat.be/data
UPDATE 2: NOV 6TH 9PM
Bella
Lillian I was tasked with creating a data visualization for the 2nd question: how has the frequency of natural disaster changed over time and what are the consequences of natural disaster in each country? I am creating a shiny interactive app that has leaflet in it. Through shiny interactivity, the user will be able to choose the timeframe (year: 1950-2020) and the type of natural disaster (e.g. flood, storm, etc.), which will be reflected in the leaflet map. I finished creating a basic ui-server structure for shiny and wrote code for appropriate input objects (e.g. sliderInput, selectInput). I also successfully created a separate leaflet (without shiny interactivity) showing how severe the impacts of natural disasters are in each country in each year. However, where I am stuck right now is making leaflet work in shiny-interactive environment. I am currently trying to integrate leaflet into shiny by consulting several resources (e.g. http://rstudio.github.io/leaflet/shiny.html) Therefore, my next plan is to troubleshoot the leaflet map problem that I am having right now and to make my app more aesthetically pleasing. As long as I figure out how to solve this problem over the weekend, I don't think I will be late on schedule.
Mythili
I'm exploring the 3rd question: How do climate change ripple effects show differently in developing (i.e. more agriculture-based) versus developed countries? Specifically, how are infant mortality and maternal mortality related to climate changes/disasters (e.g. extreme temperatures)? The topic came up from an article I read that said climate-change can impact certain agriculturally-based societies more than others (production of crops, soil health, etc. can be affected by temperature changes and more). Initially, my plan was to do k-means clustering to see the spread of countries and consider which ones have a higher infant/maternal mortality compared to others. Is there something that distinguishes these cluster (e.g. whether they're classified as developing vs. developed countries)? Currently, I have finished the clustering graphs for two different decades, the 1980s and the 2010s. Data was provided for each year from 1980-2018, however, it would be cumbersome and unnecessary to cluster all of those points (as it would be hard to distinguish multiple variables like year and country). So I averaged values within each decade, and looked at the first and last decade recorded to see the overall change. After viewing the visualizations, they didn't seem to really answer the question as I had envisioned it.
1) The clustering doesn't exactly connect how the infant/maternal mortality could be associated with climate change (since no climate change factor is a variable). This makes it seem kind of isolated from the rest of the tabs.
2) There are only around 40 countries included in the clustering (from the dataset we used), and none of these countries are from Africa, nor many parts of Asia, where agricultural societies are rampant. The article itself was a case study on countries in Africa and how climate-change induced agricultural damages affected countries and communities in poverty.
So I decided to tack on 2 other visualizations to supplement the clustering and connect my tab more to the general topic of our blog (info not included here because this update is getting very long - see next comment for more information on that!).
I achieved the work I expected, in that I started/finished the visualizations I first planned (k-means clustering graphs). I also made progress on refocusing my question and clarifying my visualizations. I still need to recheck the clustering and see if normalization of the values is necessary. I have also now added 2 visualizations (reactive line graphs) to my plate which I will work on this weekend. I'm still on track, as we wanted to finish the visualizations by Tuesday, but now there's slightly more time pressure to get the visualizations done (especially since they will be in Shiny). I would say the same checkpoint is still applicable to me although I'm more "behind" than I expected. I haven't worked as much on Data Science as I could have in the last few days, although I did continue to do data wrangling. If I had spent more time earlier in the week doing the visualizations I would have realized ahead of time what extra visualizations I would need to do and considered how my tab will answer the question posed. Regardless, I reached that point eventually and I will continue to work on my visualizations so that they're hopefully finished by Tuesday (as planned).
I wrote this for the update but then realized the prompt doesn't ask about future plans so i'm just gonna save it here for my own records--feel free to do the same by editing this if any of you'd like :D
Bella
Mythili
-Next Steps:
Visualization A) Just looking at the first and last decade of infant/maternal mortality for each country takes out a lot of information. The infant/maternal mortality changes yearly, and sometimes by large amounts. Thus, on top of the clustering (which will show how countries compare in terms of change in mortality rate), I will do a line graph in Shiny where the user can choose the country and whether they want to view infant or mortality rate. The graph will then display how the rate has changed over the years between 1980-2018. To see why this graph enhances my exploration of the question, consider the 2nd additional visualization I have planned (described below).
Visualization B) I will use natural disaster data given from the same site as my infant/maternal mortality data. Specifically, I will look at natural disasters that could directly affect agriculture. Some choices are extreme temperature (which I'm leaning toward), floods, wildfires, etc. I will put this graph on the same tab of the Shiny app as my infant/maternal mortality line graph. That way, when the user chooses subset of countries, they can compare how, say, extreme temperature changed in certain countries between 1980 and 2018 alongside how infant or maternal mortality changed for those same countries over the same time span.
Why these visualizations? What do they add? First I will note the problems I noticed that necessitated more visualizations. What I noticed from the clustering is that all countries had a drop in infant/maternal mortality rates (expected, as newer technology and medicine and somewhat improved infrastructure would contribute to healthier societies). This made it hard to see if climate change would have any impact on the infant/maternal mortality rates. The comparison between the line graphs will allow us to see more clearly whether extreme temperatures (and their changing behaviour) in certain countries could be associated with the change in infant/maternal mortality of those countries. Did it prevent further improvement in the rates; was there not as much of an effect?
Very thorough update! Glad to hear you're each on track more or less, and are making great progress!
Mythili -- to confirm, so you only have one observation per country contributing to the k-means clustering analysis, right? (e.g., you noted "So I averaged values within each decade, and looked at the first and last decade recorded to see the overall change.", meaning each country is one row and the variables included in the algorithm represent overall change in the various factors from the second from the first decade?)
Update 2: 5/5
UPDATE 3: NOV 10TH 9PM EST
Group-wide initial plan by Tuesday: "ideally must have finished creating all data viz" Next group-wide check-point: Thursday (should have the major text portion of the blog and the data viz all completed"
Bella (Q1): I was able to create some more leaflets since the last update, and they are all done except for some minor aesthetic modifications I need to make. However, I am having a more difficult time trying to interpret what the data is showing in a broader context. My initial hypothesis was that the air quality would have gotten worse over the years, which would then lead to an increase in DALY scores that were directly caused by air quality. This is not what I observed after making the leaflets, as air quality improved based on the data over time, and the DALY scores in general decreased as a logical result. So in this sense, there are possibly some factors that have positively contributed to the air quality over time; still, it is a good news that the air quality data itself seems to be consistent with the DALY scores. To make some further analyses, what I plan to do is wrangle the datasets that I have right now so that a dataset contains the following information: air quality measure, DALY scores, year, and whether a country is relatively developed or underdeveloped (developing). Then, I plan to select a few random countries from each group (developed vs developing) and see how DALY scores and air quality measures change over time (i.e. the rate) through a set of faceted line graphs. This would possibly show whether for certain countries the effect of air pollution is more prevalent in the mortality data. I already have some wrangling done, so I think I should be able to finish by Thursday at the latest, which is still for the most part aligned with our initial plan. I have already noted a couple remarks about the result, so I think I should be able to have written up a rough draft of the blog post as well by then.
Lillian (Q2): Previously, I had been having difficulty translating the interactive leaflet map into shiny context. Thanks to professor Correia, I was able to solve the problem and successfully display leaflet on shiny app. I also made the app more informative, user-friendly and aesthetically pleasing by: a) changing selectInput() to checkboxInput() so that the user can select multiple types of disasters at once; b) moving ui area to the bottom to allow more space for leaflet map; c) adding a shiny theme. However, because of the technical problem I had earlier, I was not able to completely finish my shiny app. Right now the only thing left in my shiny app is to add some background information and instructions that may be helpful to the user (e.g. the data source, etc.) I do not think that this delay was substantial enough to adjust the checkpoint, so I would like to keep the schedule as it is. Instead, I will work on PUG project intensively until our next group-wide checkpoint, which is this Thursday. First, I will finish working on my shiny app by adding the last touch, as mentioned above. Then, specifically on Wednesday and Thursday, I will focus on writing the text portion of the blog. This may be challenging because there are more missing data on my choropleth than I expected. Therefore, I will spend sufficient time exploring different combinations of variables on my choropleth to see if there are any interesting relationships that stand out.
Mythili (Q3): I started creating the new visualizations I had planned in the last update. I aimed to finish the visualizations by today, and I believe I will be able to (I just need to do the last one). Specifically, for Visualization 1 (Clustering), I standardized the cluster variables so that the clustering is done more accurately and weights each variable equally. For Visualization 2, I completed the Shiny code (an interactive bar graph - note this is a change from the line graph I had initially planned). I coded user interactivity in choosing the countries and choosing the variable displayed (either net change in infant mortality or net change in maternal mortality). For Visualization 3, I planned out more specifically what I was going to do. Initially I thought I would be able to graph extreme temperatures over time, however, once I looked at the data set, I realized it just had frequency of various natural disasters rather then the actual temperatures. Thus, I decided to switch gears and graph the frequency of different natural disasters between 1980 and 2018 to compare in tandem with the net change in mortality. This way, my visualizations will more clearly answer the question of whether climate change-induced natural disasters are affecting the change in mortality in recent years.
I'm still on par with the schedule as long as I get Visualization 3 done today. However, there might be some spillover in terms of aesthetic edits I might want to make/making the visualizations clearer. The reason why I haven't done this yet, is because I wasn't able to work on the PUG Project as much this weekend (I was catching up on the reading, doing the lab, and doing homework for other classes). Thus, the questions and considerations of what would be the best visualization are coming up later than initially planned. This time around, I've found that I'm taking a more dynamic approach in doing the visualizations and being open to changing them radically/adding new ones to best answer the question.
In terms of adjusted checkpoints for myself, I propose the following:
Tuesday 11/10: -Finish Shiny code for Visualization 3 -Restrict years shown on bar graphs so visualizations are easier to digest -Consider whether to graph overall change from 1980-2018 in cluster graph -Consider whether to switch to gapminder dataset in R
Wednesday 11/11: -Transfer code for visualizations into Blog index.Rmd file and make sure there are no error -Review blog post requirements -Start planning out text for Q3 tab (framing the question, how we answered the question, calculations/evidence for claims?) -Meet with Lillian and Bella to split up work on cover page of Blog + review Blog post requirements
Thursday 11/12: -Finish bare bones of text with citations for papers referenced/data packages used -Figure out which code I want visible on tab -Do necessary work on Cover page (add hyperlinks, intro, images)
Friday 11/13 + Weekend: -Finish Blog text, finalize citations -Add aesthetics in -Troubleshoot -Practice presentation alone + with Lillian and Bella (Sunday/Monday)
Very thorough update! Glad to hear you're each on track more or less, and are making great progress!
Mythili -- to confirm, so you only have one observation per country contributing to the k-means clustering analysis, right? (e.g., you noted "So I averaged values within each decade, and looked at the first and last decade recorded to see the overall change.", meaning each country is one row and the variables included in the algorithm represent overall change in the various factors from the second from the first decade?)
Update 2: 5/5
I made two different cluster graphs - One is for just the 1980s decade, in which each country has their average infant mortality (x-axis) and average maternal mortality (y-axis) graphed. The other graph is for the 2010s, in which, again, each country has their average infant mortality (x-axis) and average maternal mortality (y-axis) graphed. The average was calculated from the yearly values for that specific decade, so the average values in the 1980s graph are an average of the values from 1980, 1981, 1982, ..., 1989. So I used two separate tables, one per cluster graph. Now that I think about it, doing a cluster showing the overall change, with the table you mentioned, might be more useful and comprehensive, so I will consider whether to change what I have currently!
Excellent progress, team! Thorough update.
@goni99 -- I hadn't understood yesterday when we were talking that it was actually that air quality has improved over time. Can you remind me what years you're looking at?
@mysubb -- thanks for the clarification about the clustering. What you did makes sense too, so feel free to leave it as is!
Update 3: 5/5
@katcorr To answer your question itself, I was looking at 1990 to 2005 and 2017. To explain a bit of what I was bothering me, the major concern I had was essentially that the data was generally showing something opposite to what I had hypothesized, and I wasn't quite sure whether it'd be okay to simply explain what I saw and not conduct a further investigation on why that might be the case by, say, analyzing other data and potential factors. Specifically, I had expected that the air quality would have gotten worse over the years (increase in the magnitude of PM2.5) given that we are experiencing a drastic climate change, but it was actually the case where it had become better (decrease in PM2.5) in general. As far as I remember, I still do see a logical relationship between the air quality measure and other variables that I think I should be able to explain, so based on what you have advised us to do last class, I think I will stick to explaining what the data showed, try out a different additional visualization as explained, and perhaps do some researches and propose possible explanations for such results to include in the written part of the post. Thank you!
-Bella
Sorry if that was too long D:
@goni99
Yeah, okay, that does seem opposite of what I would have expected as well given the time frame. And, you're right, if this were not a class project, you probably would want to investigate further with additional data (does the same data from different sources match up?) and factors (e.g. other measures of pollution). But, given the limited timeframe we have, that could be outside the scope of this class project. It would be appropriate to note in your conclusions all of these thoughts you have (e.g., your surprise at the results, future research directions to investigate further, etc.)
SEE THE REVISED PLAN BELOW INITIAL PLAN
I. Direction
We want to switch gears from our mid-semester shiny project and now focus on environmental issues. Environmental pollution and climate disasters are closely related to both public health and social justice. Pollution is one of the risk factors for diseases (e.g. respiratory diseases) that disproportionately affect population of lower socioeconomic classes. Climate disasters can cause ripple effects such as the displacement of peoples, economic loss, homelessness, etc. We will focus on 4 research questions listed below:
Does climate change disproportionately, if at all, affect global South than global North?
Are the effects of climate change (specifically air pollution) evident in epidemiological prevalence data or mortality data?
How is the occurrence of climate-related events (e.g. natural disasters) related to the public health of communities?
-More specifically, how has the frequency of natural disasters changed throughout the past couple of decades? What do the numbers for homelessness and displacement due to natural disasters look like and how have those numbers changed over the years?
**Not a dataset, but images we could include via a link on the first page of blog to show climate change effects! https://climate.nasa.gov/images-of-change?id=739#739-spring-in-the-kulunda-steppe-russia-kazakhstan-border
The following are datasets we found on the internet. This is only a tentative list of datasets we plan to potentially use, and is subject to change.
Data on infant mortality as relevant to socioeconomic class/region of the world https://academic.oup.com/reep/article/12/1/26/4835833
Many graphs/data available on climate disasters and associated effects https://ourworldindata.org/natural-disasters
Internal Displacement due to Climate Disasters Worldwide https://www.internal-displacement.org/database/displacement-data
Data on Economic Costs and Deaths due to Specific Climate Disasters (1980 - 2020) https://www.ncdc.noaa.gov/billions/summary-stats
Global and "regional" sea level data measured by multiple satellite altimeter oceanography mission systems (1992 - 2020)
1-1. Some interesting datasets on CO2 emission by country, global temperature change, etc. -- there are datasets that we could -potentially use (perhaps a third "variable" since we only got 2 from our initial discussion); we could also write some introduction stuff based on the information provided here: https://ourworldindata.org/co2-and-other-greenhouse-gas-emissions
PM10 and PM2.5 air quality data (2010-2016) https://www.who.int/airpollution/ambient/en/ available as .xlsx on the website
Comparing pre and post covid https://aqicn.org/data-platform/covid19/ -provides 2019 Q1, Q2, Q3, Q4 and 2018, 2017, 2016, 2015 H1 data
II. Describe your intended final products (this is a rough list and is subject to change)
line graph comparing air quality between pre-covid and 2020 (Lillian) For this product, we may use a line graph showing how air quality of a certain region (or air quality globally) has changed over time from, say, 2015 to 2020 June. We may create a grouped bar chart that allows the us to compare the air quality of the same period (e.g. March) across different years.
bar/line graphs, or could also be done as a chloropleth (Mythili) This will mainly be used to examine the amount of internal displacement, economic loss, and deaths due to climate disasters. We will looks at these numbers over the past 3-4 decades (1980-2020) and compare them to separate data for sea levels, climate disaster frequency, etc. The aim is to show the ripple effects of climate disasters and indicate how the extent of the effects might change given the change in the number of climate disaster frequency.
scatterplot (Bella) Something I think I could work on is CO2 emission vs GDP by country and time, as an introductory visualization to our blogpost, because other environmental and social/public health are being integrated in the choropleths and the shiny app. This might reveal a more interesting insight at the end of the blogpost project as we will be able to see how relatively more richer countries are affected by those environmental factors compared to how much they contribute to climate change.
III. Schedule
We plan to have zoom meeting on every Tuesday, and synchronously/asynchronously communicate with FB message or github issue on every Thursday to update one another on what progress each of us made. Of course, given the workload and the time given for this project, we expect to communicate outside of regular check-points.
[x] 10/28 Wednesday: decide on topic and/or dataset (Zoom)
[x] 10/29 Thursday: decide on a dataset, discuss final product in detail, write plan (FB message)
[x] Work on data wrangling
[x] 10/30 Friday Plan Update 1
[x] Work on data wrangling / start making viz
[x] 11/3 Tuesday: check if data wrangling is complete, work on data viz/shiny app (Zoom)
[x] 11/5 Thursday: work on data viz/shiny app (FB message / github issue)
[x] 11/6 Friday Status Update 2 : Ideally, should have already started on data viz/shiny app
[x] Work on data viz/shiny app
[x] 11/10 Tuesday: plan to have data viz all prepared (Zoom)
[x] 11/10 Tuesday Status Update 3: Ideally, should have finished creating data viz/shiny app
[x] Write the "report" part of blog
[x] 11/12 Thursday: plan to have main text-portion and data viz all prepared (FB message / github issue)
[x] Write script/practice presentation
[x] 11/17 Tuesday: work on minor revisions and polishing, practice presentation (Zoom)
[ ] 11/17 or 19 Presentation
[ ] Depending on the date of our presentation, work on last-minute revision
[ ] **11/20 Friday FINAL DUE DATE