Closed ThorsteinTN closed 9 years ago
Generally a good assignment.
You might want to consider using the dplyr
package more systematically throughout your R code. I find that it makes doing data manipulation much easier.
Not the most efficient way you go about question 7. Check my solution.
APPROVED
SOCIAL DATA SCIENCE: Assignment 1
Reading packages
library("readr") library("lubridate") library("dplyr") library("ggplot2") library("scales") library("zoo") library("stringr") library("maps") library("knitr")
Read data
df <- read_csv("https://raw.githubusercontent.com/MuseumofModernArt/collection/master/Artworks.csv")
View data
View(df)
Assumption: We assume that what is meant in allp questions is all pieces at MOMA and not only paintings, as question 4-5 otherwise will be impossible to solve, as there is only one department, which has paintings.
QUESTION 1
We have two solutions as the question could be understood both as: 1) Creating a dataframe of the stock of painting at MOMA for each month, each year AND 2) Creating a dataframe of the stock of paintings at MOMA for each month in the year. We will create both dataframes and will use both of these dataframes as it makes sense. You will therefore find two solutions for some of the questions
Solution 1
Arranging the data into intuitive time range (for later plotting)
df1 <-df df1 <- arrange(df1, DateAcquired)
Generate variable containing all months from November 1929 to June, 2015
months<-seq(as.Date("1929/11/01"), as.Date("2015/06/01"), "month") dat<-data.frame(months) dat$month <- month(dat$months) dat$year <- year(dat$months) dat$date <- as.yearmon(paste(dat$year, dat$month, sep = "-"))
The month/year is extracted from the date variable using the "month" fct in the package "lubridate" and put together in a month/year variable
df1$month <- month(df1$DateAcquired) df1$year <- year(df1$DateAcquired) df1$date <- as.yearmon(paste(df1$year, df1$month, sep = "-"))
Create variable for the stock within each month: creating individual weight to each painting
df1$weight <- rep(1, nrow(df1))
using "tapply" function to sum all paintings within same month/year, saved in "stock.month" variable
stock <- tapply(df1$weight, df1$date, sum) stock <- cumsum(stock) date <-unique(df1$date) date <- na.omit(date) stock <- data.frame(date, stock)
Merge the "all months from 1929-2015" variable with information on the stock expansion in individual months
class(stock$date) class(dat$date) stock$date <- as.character(stock$date) dat$date <- as.character(dat$date) new.df1<- left_join(dat, stock)
Finally, before being able to plot all months since 1929 against the expansion in stock of painting for MOMA, the months were no expansion takes place need to get the value of the current stock (as of now all months where no expansion takes place have the value NA, but the stock is in fact in place)
new.df1<-na.locf(new.df1, fromLast = TRUE)
Solution 2
Splitting 'DateAcquired' into two variables - one for month, one for year
df2 <- df df2$Month <- format(df2$DateAcquired, "%m") df2$Year <- format(df2$DateAcquired, "%Y")
Further preparation of data for later use
Making 'CuratorApproved' into a numeric variable
df2$CuratorApproved2 = ifelse(df$CuratorApproved == "Y", 2, 1)
Removing observations with missing values for relevant variables
df3 <- na.omit(df2[,c("Month", "CuratorApproved2", "CuratorApproved", "Department", "Year")])
QUESTION 2
Solution 1
new.df1$date<-as.yearmon(new.df1$date) new.df1$stock <- as.numeric(new.df1$stock)
Creating histogram of stock of painting per month, inserting title, renaming x- and y-axis and changing color into red
p <- ggplot() + geom_point(aes(x = as.Date(new.df1$date), y =new.df1$stock), alpha=0.5, size=0.2, col="red") p <- p + geom_smooth(method = "lm", aes(x = as.Date(new.df1$date), y =new.df1$stock)) p <- p + theme_minimal() p <- p + labs(title = "Development in MOMA's stock over time \n") + xlab("\n Development over time (monthly)") + ylab("Stock of\n paintings \n") p <- p + theme(legend.position = "none") p <- p + theme(axis.title.y = element_text(angle = 360)) p <- p + theme(axis.text.x = element_text(size=11, angle=70)) p <- p + scale_x_date(breaks = pretty_breaks(15)) p
Solution 2
Creating histogram of stock of painting per month, inserting title, renaming x- and y-axis and changing color into red
p <- ggplot(df3, aes(x = df3$Month)) p <- p + labs(title = "Stock of paintings per month", x = "Month", y = "Number of paintings") p + geom_histogram(fill = "red") + theme_minimal()
QUESTION 3
Solution 2
Creating histogram showing stock of painting per month for curator-approved and non-curator approved painting, respectively
p1 <- ggplot(df3, mapping = aes(x = df3$Month,fill=df3$CuratorApproved)) p1 + layer(geom = "histogram", position = "stack") + labs(title = "Stock of paintings per month", x = "Month", y = "Number of paintings", fill = "Approved by curator?") + theme_minimal()
Showing share indstead of amount
p2 <- ggplot(df3, mapping = aes(x = df3$Month,fill=df3$CuratorApproved)) p2 + layer(geom = "histogram", position = "fill") + labs(title = "Stock of paintings per month", x = "Month", y = "Number of paintings", fill = "Approved by curator?") + theme_minimal()
QUESTION 4
Solution 1
Calculating the total number of paintings for each department
df_department1 <-tapply(df1$weight, df1$Department, sum)
turning the output vector into a dataframe for plotting
df_department1 <- data.frame(template=names(df_department1),sum=df_department1) View(df_department1)
Solution 2
Grouping data by department, creating new dataframe
df_department2 = df3 %>% group_by(Department) View(df_department2)
QUESTION 5
Solution 1
Plotting stock of paintings per year for each department
p5 <- ggplot(df_department1, aes(x = reorder(template, sum), y = sum)) p5 <- p5 + geom_bar(stat="identity", alpha=.4, fill="red") + coord_flip() p5 <- p5 + labs(title = "Increase in stock from 1929-2015 by MOMA's departments", x = "Department", y = "Stock") p5 + theme_minimal()
Solution 2
Plotting stock of painting per year for each department
p = ggplot(df_department2, aes(x = df_department2$Year)) p = p + labs(title = "Stock of paintings per Year", x = "Year", y = "Number of paintings") p + geom_bar() + facet_wrap(~ Department, scale = "free") + theme_minimal()
QUESTION 6
Setting blanks and "Unknown photographer" as missing
df$Artist[df$Artist==""] <- NA df$Artist[df$Artist=="Unknown photographer"] <- NA
using pipes to manipulate the data
df %>% filter(!is.na(Artist)) %>% group_by(Artist) %>% summarise(n = n()) %>% arrange(desc(n)) %>% View()
QUESTION 7
Creating variable containing information on birth place of each painter, using the variable ArtistBio
df$country <- strextract(df$ArtistBio, "(.[a-z]") df$country1 <- strextract(df$country, "born .[a-z]")
df$country <- sub(" .*", "", df$country) df$country <- sub(",", "", df$country) df$country <- sub("(", "", df$country) df$country <- sub(")", "", df$country) df$country <- sub(" ", "", df$country)
df$country1 <- sub(".* ", "", df$country1) df$country1 <- sub("(", "", df$country1) df$country1 <- sub(")", "", df$country1) df$country1 <- sub(" ", "", df$country1)
help.code <- unique(df$country) print(help.code)
Recoding country variable
df$country[df$country== "American"] <- "USA" df$country[df$country== "German"] <- "Germany" df$country[df$country== "Austrian"] <- "Austria" df$country[df$country== "Vienna"] <- "Austria" df$country[df$country== "French"] <- "France" df$country[df$country== "Bolivian"] <- "Bolivia" df$country[df$country== "Swiss"] <- "Switzerland" df$country[df$country== "Spanish"] <- "Spain" df$country[df$country== "Catalan"] <- "Spain" df$country[df$country== "Dutch"] <- "Netherlands" df$country[df$country== "Russian"] <- "Russia" df$country[df$country== "Italian"] <- "Italy" df$country[df$country== "British"] <- "UK" df$country[df$country== "Swedish"] <- "Sweden" df$country[df$country== "Mexican"] <- "Mexico" df$country[df$country== "Polish"] <- "Poland" df$country[df$country== "Ukrainian"] <- "Ukraine" df$country[df$country== "Uruguayan"] <- "Uruguay" df$country[df$country== "Venezuelan"] <- "Venezuela" df$country[df$country== "Indian"] <- "India" df$country[df$country== "Vietnamese"] <- "Vietnam" df$country[df$country== "Zimbabwean"] <- "Zimbabwe" df$country[df$country== "Tunisian"] <- "Tunisia" df$country[df$country== "Turkish"] <- "Turkey" df$country[df$country== "Brazilian"] <- "Brazil" df$country[df$country== "Japanese"] <- "Japan" df$country[df$country== "Chinese"] <- "China" df$country[df$country== "Belgian"] <- "Belgium" df$country[df$country== "Argentine"] <- "Argentine" df$country[df$country== "Alsace"] <- "France" df$country[df$country== "Chilean"] <- "Chile" df$country[df$country== "Cuban"] <- "Cuba" df$country[df$country== "Danish"] <- "Denmark" df$country[df$country== "Guatemalan"] <- "Guatemala" df$country[df$country== "Colombian"] <- "Colombia" df$country[df$country== "Croatian"] <- "Croatia" df$country[df$country== "Egyptian"] <- "Egypt" df$country[df$country== "Finnish"] <- "Finland" df$country[df$country== "Greek"] <- "Grecce" df$country[df$country== "Haitian"] <- "Haiti" df$country[df$country== "Hungarian"] <- "Hungary" df$country[df$country== "Iranian"] <- "Iran" df$country[df$country== "Latvian"] <- "Latvia" df$country[df$country== "Pakistani"] <- "Pakistan" df$country[df$country== "Peruvian"] <- "Peru" df$country[df$country== "Portuguese"] <- "Portugal" df$country[df$country== "Romanian"] <- "Romania" df$country[df$country== "Scottish"] <- "Scotland" df$country[df$country== "Yugoslav"] <- "Yugoslavia" df$country[df$country== "Russia"] <- "USSR" df$country[df$country== "Canadian"] <- "Canada" df$country[df$country== "Norwegian"] <- "Norway" df$country[df$country== "Paraguayan"] <- "Paraguay" df$country[df$country== "Ecuadorian"] <- "Ecuador" df$country[df$country== "Slovenian"] <- "Slovenia" df$country[df$country== "Israeli"] <- "Israel" df$country[df$country== "Thai"] <- "Thailand" df$country[df$country== "Korean"] <- "Korea" df$country[df$country== "Osaka"] <- "Japan" df$country[df$country== "Taiwanese"] <- "Taiwan" df$country[df$country== "Anglo-Irish"] <- "Ireland" df$country[df$country== "English"] <- "UK" df$country[df$country== "Irish"] <- "Ireland" df$country[df$country== "Ethiopian"] <- "Ethiopia" df$country[df$country== "Panamanian"] <- "Panama" df$country[df$country== "Cambodian"] <- "Cambodia" df$country[df$country== "Slovak"] <- "Slovakia" df$country[df$country== "Bulgarian"] <- "Bulgaria" df$country[df$country== "Guyanese"] <- "Guyana" df$country[df$country== "Costa"] <- "Costa Rica" df$country[df$country== "Sudanese"] <- "Sudan" df$country[df$country== "Senegalese"] <- "Senegal" df$country[df$country== "Malian"] <- "Mali" df$country[df$country== "Georgian"] <- "Georgia" df$country[df$country== "Kenyan"] <- "Kenya" df$country[df$country== "Nigerian"] <- "Nigeria" df$country[df$country== "Barcelona"] <- "Spain" df$country[df$country== "Azerbaijani"] <- "Azerbaijan" df$country[df$country== "Malaysian"] <- "Malaysia" df$country[df$country== "Lithuaniann"] <- "Lithuania" df$country[df$country== "Namibian"] <- "Namibia" df$country[df$country== "Ghanaian"] <- "Ghana" df$country[df$country== "Filipino"] <- "Philippines" df$country[df$country== "Kyrgyzstani"] <- "Kyrgyzstan" df$country[df$country== "Ugandan"] <- "Uganda" df$country[df$country== "Zagreb"] <- "Croatia" df$country[df$country== "Kazakhstani"] <- "Kazakhstan" df$country[df$country== "Lebanese"] <- "Lebanon" df$country[df$country== "Singaporean"] <- "Singapore" df$country[df$country== "Nowegian"] <- "Norway" df$country[df$country== "Afgan"] <- "Afghanistan" df$country[df$country== "Cameroonian"] <- "Cameroon" df$country[df$country== "Czechoslovakian"] <- "Czechoslovakia" df$country[df$country== "Mauritanian"] <- "Mauritania" df$country[df$country== "Syrian"] <- "Syria" df$country[df$country== "Rwandan"] <- "Rwanda" df$country[df$country== "Albanian"] <- "Albania" df$country[df$country== "Ivorian"] <- "Côte d'Ivoire" df$country[df$country== "Serbian"] <- "Serbia" df$country[df$country== "Tajik"] <- "Tajikistan" df$country[df$country== "Welsh"] <- "Wales" df$country[df$country== "Saudi"] <- "Saudi Arabia" df$country[df$country== "Iraqi"] <- "Iraq" df$country[df$country== "Palestinian"] <- "Palestine"
Making sure that country of origin - not nationality - is mapped ("born in" has higher priority than stated nationality)
df <- within(df, country1 <- ifelse(is.na(country1), country, country1))
df$weight <- rep(1, nrow(df)) map.df <- tapply(df$weight, df$country1, sum) # now, the amount of paintings per country is to be calculated map.df <- data.frame(template=names(map.df),sum=map.df) colnames(map.df) <- c("country", "Paintings") View(map.df)
map.data <- map_data("world") # now, the two datasets (1: long/lat, 2:number of paintings) can be merged colnames(map.data) <- c("long", "lat", "group", "order", "country", "subregion")
map.df$country <- as.character(map.df$country) map.data$country <- as.character(map.data$country) merge.map <- merge(map.data, map.df, all.x =T) # all.x=TRUE keeps unmatched rows
merge.map <- arrange(merge.map, group, order) merge.map$logpainting<-log(merge.map$Paintings) # As the USA is a major outlier
mapping the countries where the artists in MOMAs collection were born
p7 <- ggplot(merge.map, aes(x = long, y = lat, group=group)) + geom_polygon(aes(fill = logpainting), colour = "black") + expand_limits() + scale_fill_gradient("Paintings (log)") + theme_minimal() + labs(x = NULL, y = NULL, title = "Total amount of paintings per country at MOMA")
p7
QUESTION 8
Extracting the width and the length of the paintings (if only measurement = ", => NA)
df$width <- strextract(df$Dimensions, "(.[0-9] x") df$width <- sub("(", "", df$width) df$width <- sub(" ._", "", df$width) df$width <- as.numeric(df$width) head(df$width)
df$hight <- strextract(df$Dimensions, "(.[0-9].)") df$hight <- sub(").", "", df$hight) df$hight <- strextract(df$hight, " .[0-9] cm") df$hight <- sub("x ", "", df$hight) df$hight <- sub(" cm", "", df$hight) df$hight <- sub(" x.*", "", df$hight) df$hight <- as.numeric(df$hight)
calculating the area (cm)
df$area <- df$hight * df$width
plot.area <- data.frame(df$
Title
, df$Artist, df$area) colnames(plot.area) <- c("title", "artist", "area") plot.area <- arrange(plot.area, desc(area)) plot.area[1:10,]plotting the results
plot.area1 <- arrange(plot.area, desc(area)) plot.area1 <- plot.area[1:10,] plot.area1$ranking <- c(10:1) plot.area1$help <- rep(-0.25, nrow(plot.area1)) plot.area1$nameloc <- plot.area1$ranking - 0.3 ranking.lab <- c("Largest Painting in \n MOMAs Collection", 2, 3, 4, 5, 6, 7, 8, 9, 10) ranking.pos <- c(10, 9, 8, 7, 6, 5, 4, 3, 2, 1) y1 <- c(-0.15,-0.15,-0.15,-0.15,-0.15,-0.15,-0.15,-0.15,-0.15,-0.15)
attach(plot.area1)
p8 <- ggplot() + geom_point(aes(x = ranking, y = help), alpha=0.5, size = area/40000, col="red") + theme_minimal() + coord_flip() p8 <- p8 + xlim(0,10.3) + ylim(-0.50,0.50) p8 <- p8 + annotate("text", label = title, x = ranking, y = y1, size = 6, angle=0, col="red", fontface="bold", alpha=.5, hjust = 0) p8 <- p8 + theme(axis.line=element_blank(), axis.text.x=element_blank(), axis.text.y=element_blank(), axis.ticks=element_blank(), axis.title.x=element_blank(), axis.title.y=element_blank())
p8 <- p8 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) p8 <- p8 + annotate("text", label = ranking.lab, x = ranking.pos, y = -0.45, size = 6, fontface="bold", alpha=.5) p8 <- p8 + annotate("text", label = plot.area1$artist, x = plot.area1$nameloc, y = y1, size=3, hjust = 0) p8