sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Group 4: Assignment 1 #4

Closed sunecasp closed 9 years ago

sunecasp commented 9 years ago
# Assignment 1 ----

# Load libraries
library(readr)
library(dplyr)
library(ggplot2)
library(stringr)
library(lubridate)
library(ggmap)
library(mapproj)
library(countrycode)

#
# Question 1 ----
#

# Load data
df = read_csv("D:/NewDrive/Backup/Polit/Social Data Science/Artworks.csv")

# Fix BOM corruption of Title column name
names(df)[1] = "Title"

# Fix character encoding
df$Title = str_conv(df$Title, "UTF-8")
df$Artist = str_conv(df$Artist, "UTF-8")
df$ArtistBio = str_conv(df$ArtistBio, "UTF-8")
df$Dimensions = str_conv(df$Dimensions, "UTF-8")

#### IMPORTANT NOTE ####
# If by "paintings" in the assignment text is meant actual paintings, the following filter
#  should be applied to the data set. If it more generally refers to artworks (that is, the
#  entirety of the data set), then commenting out the line will result in the assignments
#  being solved for all artworks in the data set.
#### END OF NOTE    ####

# Filter only paintings
df = filter(df, Classification == "Painting")

# Create year/month variables for grouping
df$AcquiredYear = year(df$DateAcquired)
df$AcquiredMonth = month(df$DateAcquired)

# Create data frame of stock by month in year
df.stock.month = df %>%
   group_by(AcquiredYear, AcquiredMonth) %>% # group by year/month
   na.omit() %>% # omit missing observations
   summarise(
      flow = n() # count flow of paintings
   ) %>%
   mutate(AcquiredDate = as.Date(sprintf("%04d-%02d-%02d", AcquiredYear, AcquiredMonth, 1))) %>% # create date variable for year/month
   ungroup() %>%
   mutate(stock = cumsum(flow)) # calculate stock of paintings

# Create data frame of stock by year - same as above, but only by year
df.stock.year = df %>%
   group_by(AcquiredYear) %>%
   na.omit() %>%
   summarise(
      flow = n()
   ) %>%
   mutate(AcquiredDate = as.Date(sprintf("%04d-%02d-%02d", AcquiredYear, 1, 1))) %>%
   ungroup() %>%
   mutate(stock = cumsum(flow))

# Create data frame of stock by curator approval and year - again, same as above
df.stock.year.curator = df %>%
   group_by(CuratorApproved, AcquiredYear) %>%
   na.omit() %>%
   summarise(
      flow = n()
   ) %>%
   mutate(AcquiredDate = as.Date(sprintf("%04d-%02d-%02d", AcquiredYear, 1, 1))) %>%
   ungroup() %>%
   mutate(stock = cumsum(flow)) %>%
   ungroup()

# To fix the problem of missing observations in some years, we loop over the data set
# and create "dummy" observations with the last observed stock in order to produce
# a sensible graph.

# We first do the process for non-curator approved paintings
last.stock = 0

for (year in min(df.stock.year.curator$AcquiredYear):max(df.stock.year.curator$AcquiredYear)) {
   obs = filter(df.stock.year.curator, AcquiredYear == year, CuratorApproved == "N")
   if (nrow(obs) == 0) { # check if observation exists within the year
      # if not, create a new observation with the last observed stock
      df.stock.year.curator = rbind(df.stock.year.curator,
                            data.frame(CuratorApproved = "N",
                                       AcquiredYear = year,
                                       flow = 0,
                                       AcquiredDate = as.Date(sprintf("%04d-%02d-%02d", year, 1, 1)),
                                       stock = last.stock))
   }
   else { # in case it exists, update the last observed stock variable
      last.stock = obs$stock
   }
}

# We repeat the process for curator approved paintings
last.stock = 0

for (year in min(df.stock.year.curator$AcquiredYear):max(df.stock.year.curator$AcquiredYear)) {
   obs = filter(df.stock.year.curator, AcquiredYear == year, CuratorApproved == "Y")
   if (nrow(obs) == 0) {
      df.stock.year.curator = rbind(df.stock.year.curator,
                                    data.frame(CuratorApproved = "Y",
                                               AcquiredYear = year,
                                               flow = 0,
                                               AcquiredDate = as.Date(sprintf("%04d-%02d-%02d", year, 1, 1)),
                                               stock = last.stock))
   }
   else {
      last.stock = obs$stock
   }
}

# Finally, the data set must be re-arranged in order for the graph to display correctly after
# adding the dummy observations in years with missing data
df.stock.year.curator = df.stock.year.curator %>%
   group_by(CuratorApproved, AcquiredYear) %>%
   arrange(CuratorApproved, AcquiredYear)

#
# Question 2 ----
#

# Plotting the stock of paintings by year with bars
p = ggplot(df.stock.year, aes(x = AcquiredDate, y = stock)) + geom_bar(fill = "red", stat = "identity")
p = p + labs(x = "Date", y = "Stock of Paintings", title = "Acquired Paintings") + theme_minimal()
plot(p)

# Reasoning for geom
#  To represent the increasing stock, we choose a geom with a coloured area to clearly indicate that
#  the stock gets larger and larger as time goes by.
#  Choosing a bar diagram over an area (line with filled area below) adds some "air" to the graph.
#  As the time range is quite large, the bar diagram by year becomes almost continous like a
#  line/area geom.

#
# Question 3 ----
#

# Plot of the stock of paintings by year and curator approval status with bars
p = ggplot(df.stock.year.curator, aes(x = AcquiredDate, y = stock, fill = CuratorApproved)) + geom_bar(stat = "identity")
p = p + labs(x = "Date", y = "Stock of Paintings", title = "Acquired Paintings") + theme_minimal()
plot(p)

#
# Question 4 ----
#

df.department = df %>%
   group_by(Department) %>% # group by department
   summarise(
      stock = n() # count stock of paintings
   )

# Merge the architecture & design departments
df.department$Department[df.department$Department == "Architecture & Design - Image Archive"] = "Architecture & Design"

df.department = df.department %>%
   group_by(Department) %>%
   summarise(
      stock = sum(stock)
   )

#
# Question 5 ----
#

p = ggplot(data = df.department, aes(x = Department, y = stock))
p = p + geom_bar(stat = "identity", position = "identity", fill = "steelblue") + coord_flip() + theme_minimal()
# Add a neat label with the specific value of each stock next to the respective bar
p = p + geom_text(aes(x = Department, y = stock, label = format(df.department$stock)), hjust = -0.2, size = 4, color = "#666666")
p = p + labs(x = "Department", y = "Stock of Paintings", title = "Stock of Paintings by Department")
p

# Highest increase:
#  Painting & Sculpture with 2193 paintings

#
# Question 6 ----
#

df.artist = df %>%
   filter(
      !Artist %in% c("", "Unknown photographer") # filter out missing artist specification
   ) %>%
   group_by(Artist) %>% # group by artist
   summarise(
      stock = n() # count number of paintings
   ) %>%
   arrange(-stock) %>% # order descending
   head(10) # grab first 10 observations

df.artist

#
# Question 7 ----
#

df$Origin = sub( # extract nationality/birthplace from bio (far from perfect but at least only uses 1 group expression)
   ".*?(?:born)*(?!To|to)(?<!To |to )([A-Z][\\w.-]+(\\s[A-Z][\\w.]+)?).*",
   "\\1", df$ArtistBio, perl = TRUE
)

df.origin = filter(df, # filter out the missing/invalid observations
                   !grepl("(born|est|active|founded|unknown|Unknown)", Origin),
                   !grepl("\\d", Origin),
                   Origin != "",
                   Origin != "November",
                   Origin != "Nationality",
                   Origin != "Various",
                   Origin != "Active",
                   Origin != "Jackson Center")

replacement = list( # huge replacement list converting nationality into country
   "Austrian" = "Austria",
   "French" = "France",
   "American" = "USA",
   "German" = "Germany",
   "Dutch" = "The Netherlands",
   "Italian" = "Italy",
   "Swedish" = "Sweden",
   "British" = "UK",
   "Japanese" = "Japan",
   "Argentine" = "Argentina",
   "Brazilian" = "Brazil",
   "Swiss" = "Switzerland",
   "Luxembourgish" = "Luxembourg",
   "Spanish" = "Spain",
   "Russian" = "Russia",
   "Iranian" = "Iran",
   "Finnish" = "Finland",
   "Canadian" = "Canada",
   "Danish" = "Denmark",
   "Belgian" = "Belgium",
   "English" = "UK",
   "Morrocan" = "Morocco",
   "Columbian" = "Colombia",
   "Australian" = "Australia",
   "Mexican" = "Mexico",
   "Yugoslav" = "Yugoslavia",
   "Scottish" = "UK",
   "Czech" = "Czech Republic",
   "Osaka" = "Japan",
   "Hungarian" = "Hungary",
   "Polish" = "Poland",
   "Slovenian" = "Slovenia",
   "Chilean" = "Chile",
   "Anglo-Irish" = "Ireland",
   "Latvian" = "Latvia",
   "Catalan" = "Spain",
   "Greek" = "Greece",
   "Israeli" = "Israel",
   "Icelandic" = "Iceland",
   "Croatian" = "Croatia",
   "Norwegian" = "Norway",
   "Ukranian" = "Ukraine",
   "Cuban" = "Cuba",
   "Finish" = "Finland",
   "Romanian" = "Romania",
   "Venezuelan" = "Venezuela",
   "Uruguayan" = "Uruguay",
   "Ukrainian" = "Ukraine",
   "Georgian" = "Georgia",
   "Thai" = "Thailand",
   "Algerian" = "Algeria",
   "Colombian" = "Colombia",
   "Guatemalan" = "Guatemala",
   "Indian" = "India",
   "Chinese" = "China",
   "Irish" = "Ireland",
   "Costa Rican" = "Costa Rica",
   "Korean" = "Korea",
   "Ethiopian" = "Ethiopia",
   "Kuwaiti" = "Kuwait",
   "Haitian" = "Haiti",
   "South African" = "South Africa",
   "Zimbabwean" = "Zimbabwe",
   "Portuguese" = "Portugal",
   "Panamanian" = "Panama",
   "Ecuadorian" = "Ecuador",
   "Peruvian" = "Peru",
   "Congolese" = "Congo",
   "Malian" = "Mali",
   "Turkish" = "Turkey",
   "Cambodian" = "Cambodia",
   "Bosnian" = "Bosnia",
   "Canadian Inuit" = "Canada",
   "Slovak" = "Slovakia",
   "Estonian" = "Estonia",
   "Pakistani" = "Pakistan",
   "Bulgarian" = "Bulgaria",
   "Bolivian" = "Bolivia",
   "Taiwanese" = "Taiwan",
   "Paraguayan" = "Paraguay",
   "Nicaraguan" = "Nicaragua",
   "Tunisian" = "Tunisia",
   "Sudanese" = "Sudan",
   "Tanzanian" = "Tanzania",
   "Guyanese" = "Guyana",
   "Great Britain" = "UK",
   "Senegalese" = "Senegal",
   "Bahamian" = "Bahamas",
   "South Korean" = "South Korea",
   "Vietnamese American" = "USA",
   "New Zealander" = "New Zealand",
   "Barcelona" = "Spain",
   "Azerbaijani" = "Azerbaijan",
   "Egyptian" = "Egypt",
   "Ivorian" = "Ivory Coast",
   "Malaysian" = "Malaysia",
   "Singaporean" = "Singapore",
   "Serbian" = "Serbia",
   "Lithuanian" = "Lithuania",
   "Nowegian" = "Norway",
   "Tajik" = "Tajikistan",
   "Slovakian" = "Slovakia",
   "Namibian" = "Namibia",
   "Native American" = "USA",
   "Ghanaian" = "Ghana",
   "Afghan" = "Afghanistan",
   "Fred Chichin" = "France",
   "Filipino American" = "USA",
   "Yugoslavian" = "Yugoslavia",
   "Kyrgyzstani" = "Kyrgyzstan",
   "Czechoslovakian" = "Czech Republic",
   "Ugandan" = "Uganda",
   "Cameroonian" = "Cameroon",
   "Welsh" = "UK",
   "Zagreb" = "Croatia",
   "Mauritanian" = "Mauritania",
   "Syrian" = "Syria",
   "Saudi Arabian" = "Saudi Arabia",
   "Kazakhstani" = "Kazakhstan",
   "Rwandan" = "Rwanda",
   "Iraqi" = "Iraq",
   "Indonesian" = "Indonesia",
   "Vietnamese" = "Vietnam",
   "Burkinabe" = "Burkina Faso",
   "Macedonian" = "Macedonia",
   "Bosnian-Croatian" = "Bosnia",
   "Filipino" = "Philippines",
   "Mozambican" = "Mozambique",
   "Angolan" = "Angola",
   "Jordanian" = "Jordan",
   "French-Moroccan" = "Morocco",
   "Moroccan" = "Morocco",
   "Lebanese" = "Lebanon",
   "Nigerian" = "Nigeria",
   "Albanian" = "Albania",
   "Kenyan" = "Kenya")

df.origin$Country = replacement[df.origin$Origin] # apply replacement list
df.origin$Country = ifelse(df.origin$Country == "NULL", df.origin$Origin, df.origin$Country) # if not replaced, origin is already country

df.origin$iso3c = countrycode(df.origin$Country, "country.name", "iso3c") # fetch iso-3 country code

df.origin.country = df.origin %>%
   group_by(iso3c) %>% # group by country code
   summarise(
      stock = n() # and count number of paintings
   )

# Load relevant map-related packages
library(maptools)
library(rgeos)
library(rgdal)
library(jsonlite)
library(RCurl)
library(grid)

devtools::source_gist("33baa3a79c5cfef0f6df") # load theme_map() from github/gist

# World map from naturalearthdata.com - Admin 0 Countries, Cultural 1:50m scale
world = readOGR("D:/NewDrive/Backup/Polit/Social Data Science/ne_50m_admin_0_countries/ne_50m_admin_0_countries.shp",
                layer="ne_50m_admin_0_countries") # "medium-res" world map

world = world[!world$iso_a3 %in% c("ATA"),] # exclude antarctica
world = spTransform(world, CRS("+proj=wintri")) # apply Winkel Tripel projection
map = fortify(world, region = "iso_a3") # convert to data frame

p = ggplot() + geom_map(data = map, map = map, aes(x = long, y = lat, map_id = id, group = group), fill = "#f7fcb9", color = NA)
p = p + geom_map(data = df.origin.country, map = map, color = "white", size = 0.15, aes(fill = log(stock), group = iso3c, map_id = iso3c))
p = p + coord_equal(ratio = 1) # set 1:1 ratio between coordinates
p = p + scale_fill_gradient(low = "#f7fcb9", high = "#31a354", name = "Paintings by Country\n(log scale)")
p = p + theme_map(base_size = 14)
p = p + labs(title = "MoMA Paintings")
p = p + theme(legend.position = "bottom", legend.key = element_blank(), plot.title = element_text(size = 24))
plot(p)

#
# Question 8 ----
#

df.dimensions = df

# Extract dimensions in cm from the parenthesis using a 2-group expression
# Approach one
df.dimensions$DimensionsExtract = sub(".*?\\(([\\d]+\\.?[\\d]*)\\s[xX\xD7]\\s([\\d]+\\.?[\\d]*).*", "\\1 \\2", df.dimensions$Dimensions, perl = TRUE)

# Approach two (grabs last two dimensions instead of first two - unable to account for 3-dimensional specifications)
#df.dimensions$DimensionsExtract = sub(".*[\\s\\(](\\d+\\.?\\d*)\\s[xX\xD7]\\s(\\d+\\.?\\d*)\\scm\\).*", "\\1 \\2", df.dimensions$Dimensions, perl = TRUE)

# Discard faulty observations that require too much work to use
df.dimensions = filter(df.dimensions, grepl("[\\d]+\\.?[\\d]*\\s[\\d]+\\.?[\\d]*$", df.dimensions$DimensionsExtract, perl = TRUE))
df.dimensions = filter(df.dimensions, grepl("^[\\d.]*\\s[\\d.]+$", df.dimensions$DimensionsExtract, perl = TRUE))

# Split up the dimensions into height and width respectively
df.dimensions$Height = as.numeric(str_split_fixed(df.dimensions$DimensionsExtract, " ", 2)[,1])
df.dimensions$Width = as.numeric(str_split_fixed(df.dimensions$DimensionsExtract, " ", 2)[,2])

df.dimensions = mutate(df.dimensions, Area = Height * Width) # compute the area

df.dimensions.largest = df.dimensions %>%
   arrange(-Area) %>% # order by area descending
   head(5) # grab first 5

df.dimensions.smallest = df.dimensions %>%
   arrange(Area) %>% # order by area ascending
   head(5) # grab first 5

select(df.dimensions.largest, Title, Artist, Height, Width, Area, Dimensions)
select(df.dimensions.smallest, Title, Artist, Height, Width, Area, Dimensions)

# Comments:
#  We realise that our extraction expression is not perfect. Our expression extracts the first two available dimensions
#  within the parenthesis and treats these as width and height. Some paintings consist of smaller parts and have an
#  overall larger size, but we cannot account for this in a simple fashion.
sebastianbarfort commented 9 years ago

Hi, this looks like a very nice assignment, but I can't read your files as the path refers to something locally on your computer. Can you email me the data so I can run the code?

sebastianbarfort commented 9 years ago

Beautiful assignment!

I really like almost everything you do. The only thing I find supoptimal is your approach to question 7. I would recommend using ggmap and related functions and perhaps attempt another approach to manually fixing the country-name problem (it won't scale to larger data sets). See my solution.

Also, your regex could be done in a more simple way. Again, see my solution.

But generally, really great assignment!

Keep up the good work,

APPROVED