sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Group 18: Assignment 1 #17

Closed ThorsteinTN closed 9 years ago

ThorsteinTN commented 9 years ago

SOCIAL DATA SCIENCE: Assignment 1

Reading packages

library("readr") library("lubridate") library("dplyr") library("ggplot2") library("scales") library("zoo") library("stringr") library("maps") library("knitr")

Read data

df <- read_csv("https://raw.githubusercontent.com/MuseumofModernArt/collection/master/Artworks.csv")

View data

View(df)

Assumption: We assume that what is meant in allp questions is all pieces at MOMA and not only paintings, as question 4-5 otherwise will be impossible to solve, as there is only one department, which has paintings.

QUESTION 1

We have two solutions as the question could be understood both as: 1) Creating a dataframe of the stock of painting at MOMA for each month, each year AND 2) Creating a dataframe of the stock of paintings at MOMA for each month in the year. We will create both dataframes and will use both of these dataframes as it makes sense. You will therefore find two solutions for some of the questions

Solution 1

Arranging the data into intuitive time range (for later plotting)

df1 <-df df1 <- arrange(df1, DateAcquired)

Generate variable containing all months from November 1929 to June, 2015

months<-seq(as.Date("1929/11/01"), as.Date("2015/06/01"), "month") dat<-data.frame(months) dat$month <- month(dat$months) dat$year <- year(dat$months) dat$date <- as.yearmon(paste(dat$year, dat$month, sep = "-"))

The month/year is extracted from the date variable using the "month" fct in the package "lubridate" and put together in a month/year variable

df1$month <- month(df1$DateAcquired) df1$year <- year(df1$DateAcquired) df1$date <- as.yearmon(paste(df1$year, df1$month, sep = "-"))

Create variable for the stock within each month: creating individual weight to each painting

df1$weight <- rep(1, nrow(df1))

using "tapply" function to sum all paintings within same month/year, saved in "stock.month" variable

stock <- tapply(df1$weight, df1$date, sum) stock <- cumsum(stock) date <-unique(df1$date) date <- na.omit(date) stock <- data.frame(date, stock)

Merge the "all months from 1929-2015" variable with information on the stock expansion in individual months

class(stock$date) class(dat$date) stock$date <- as.character(stock$date) dat$date <- as.character(dat$date) new.df1<- left_join(dat, stock)

Finally, before being able to plot all months since 1929 against the expansion in stock of painting for MOMA, the months were no expansion takes place need to get the value of the current stock (as of now all months where no expansion takes place have the value NA, but the stock is in fact in place)

new.df1<-na.locf(new.df1, fromLast = TRUE)

Solution 2

Splitting 'DateAcquired' into two variables - one for month, one for year

df2 <- df df2$Month <- format(df2$DateAcquired, "%m") df2$Year <- format(df2$DateAcquired, "%Y")

Further preparation of data for later use

Making 'CuratorApproved' into a numeric variable

df2$CuratorApproved2 = ifelse(df$CuratorApproved == "Y", 2, 1)

Removing observations with missing values for relevant variables

df3 <- na.omit(df2[,c("Month", "CuratorApproved2", "CuratorApproved", "Department", "Year")])

QUESTION 2

Solution 1

new.df1$date<-as.yearmon(new.df1$date) new.df1$stock <- as.numeric(new.df1$stock)

Creating histogram of stock of painting per month, inserting title, renaming x- and y-axis and changing color into red

p <- ggplot() + geom_point(aes(x = as.Date(new.df1$date), y =new.df1$stock), alpha=0.5, size=0.2, col="red") p <- p + geom_smooth(method = "lm", aes(x = as.Date(new.df1$date), y =new.df1$stock)) p <- p + theme_minimal() p <- p + labs(title = "Development in MOMA's stock over time \n") + xlab("\n Development over time (monthly)") + ylab("Stock of\n paintings \n") p <- p + theme(legend.position = "none") p <- p + theme(axis.title.y = element_text(angle = 360)) p <- p + theme(axis.text.x = element_text(size=11, angle=70)) p <- p + scale_x_date(breaks = pretty_breaks(15)) p

Solution 2

Creating histogram of stock of painting per month, inserting title, renaming x- and y-axis and changing color into red

p <- ggplot(df3, aes(x = df3$Month)) p <- p + labs(title = "Stock of paintings per month", x = "Month", y = "Number of paintings") p + geom_histogram(fill = "red") + theme_minimal()

QUESTION 3

Solution 2

Creating histogram showing stock of painting per month for curator-approved and non-curator approved painting, respectively

p1 <- ggplot(df3, mapping = aes(x = df3$Month,fill=df3$CuratorApproved)) p1 + layer(geom = "histogram", position = "stack") + labs(title = "Stock of paintings per month", x = "Month", y = "Number of paintings", fill = "Approved by curator?") + theme_minimal()

Showing share indstead of amount

p2 <- ggplot(df3, mapping = aes(x = df3$Month,fill=df3$CuratorApproved)) p2 + layer(geom = "histogram", position = "fill") + labs(title = "Stock of paintings per month", x = "Month", y = "Number of paintings", fill = "Approved by curator?") + theme_minimal()

QUESTION 4

Solution 1

Calculating the total number of paintings for each department

df_department1 <-tapply(df1$weight, df1$Department, sum)

turning the output vector into a dataframe for plotting

df_department1 <- data.frame(template=names(df_department1),sum=df_department1) View(df_department1)

Solution 2

Grouping data by department, creating new dataframe

df_department2 = df3 %>% group_by(Department) View(df_department2)

QUESTION 5

Solution 1

Plotting stock of paintings per year for each department

p5 <- ggplot(df_department1, aes(x = reorder(template, sum), y = sum)) p5 <- p5 + geom_bar(stat="identity", alpha=.4, fill="red") + coord_flip() p5 <- p5 + labs(title = "Increase in stock from 1929-2015 by MOMA's departments", x = "Department", y = "Stock") p5 + theme_minimal()

Solution 2

Plotting stock of painting per year for each department

p = ggplot(df_department2, aes(x = df_department2$Year)) p = p + labs(title = "Stock of paintings per Year", x = "Year", y = "Number of paintings") p + geom_bar() + facet_wrap(~ Department, scale = "free") + theme_minimal()

QUESTION 6

Setting blanks and "Unknown photographer" as missing

df$Artist[df$Artist==""] <- NA df$Artist[df$Artist=="Unknown photographer"] <- NA

using pipes to manipulate the data

df %>% filter(!is.na(Artist)) %>% group_by(Artist) %>% summarise(n = n()) %>% arrange(desc(n)) %>% View()

QUESTION 7

Creating variable containing information on birth place of each painter, using the variable ArtistBio

df$country <- strextract(df$ArtistBio, "(.[a-z]") df$country1 <- strextract(df$country, "born .[a-z]")

df$country <- sub(" .*", "", df$country) df$country <- sub(",", "", df$country) df$country <- sub("(", "", df$country) df$country <- sub(")", "", df$country) df$country <- sub(" ", "", df$country)

df$country1 <- sub(".* ", "", df$country1) df$country1 <- sub("(", "", df$country1) df$country1 <- sub(")", "", df$country1) df$country1 <- sub(" ", "", df$country1)

help.code <- unique(df$country) print(help.code)

Recoding country variable

df$country[df$country== "American"] <- "USA" df$country[df$country== "German"] <- "Germany" df$country[df$country== "Austrian"] <- "Austria" df$country[df$country== "Vienna"] <- "Austria" df$country[df$country== "French"] <- "France" df$country[df$country== "Bolivian"] <- "Bolivia" df$country[df$country== "Swiss"] <- "Switzerland" df$country[df$country== "Spanish"] <- "Spain" df$country[df$country== "Catalan"] <- "Spain" df$country[df$country== "Dutch"] <- "Netherlands" df$country[df$country== "Russian"] <- "Russia" df$country[df$country== "Italian"] <- "Italy" df$country[df$country== "British"] <- "UK" df$country[df$country== "Swedish"] <- "Sweden" df$country[df$country== "Mexican"] <- "Mexico" df$country[df$country== "Polish"] <- "Poland" df$country[df$country== "Ukrainian"] <- "Ukraine" df$country[df$country== "Uruguayan"] <- "Uruguay" df$country[df$country== "Venezuelan"] <- "Venezuela" df$country[df$country== "Indian"] <- "India" df$country[df$country== "Vietnamese"] <- "Vietnam" df$country[df$country== "Zimbabwean"] <- "Zimbabwe" df$country[df$country== "Tunisian"] <- "Tunisia" df$country[df$country== "Turkish"] <- "Turkey" df$country[df$country== "Brazilian"] <- "Brazil" df$country[df$country== "Japanese"] <- "Japan" df$country[df$country== "Chinese"] <- "China" df$country[df$country== "Belgian"] <- "Belgium" df$country[df$country== "Argentine"] <- "Argentine" df$country[df$country== "Alsace"] <- "France" df$country[df$country== "Chilean"] <- "Chile" df$country[df$country== "Cuban"] <- "Cuba" df$country[df$country== "Danish"] <- "Denmark" df$country[df$country== "Guatemalan"] <- "Guatemala" df$country[df$country== "Colombian"] <- "Colombia" df$country[df$country== "Croatian"] <- "Croatia" df$country[df$country== "Egyptian"] <- "Egypt" df$country[df$country== "Finnish"] <- "Finland" df$country[df$country== "Greek"] <- "Grecce" df$country[df$country== "Haitian"] <- "Haiti" df$country[df$country== "Hungarian"] <- "Hungary" df$country[df$country== "Iranian"] <- "Iran" df$country[df$country== "Latvian"] <- "Latvia" df$country[df$country== "Pakistani"] <- "Pakistan" df$country[df$country== "Peruvian"] <- "Peru" df$country[df$country== "Portuguese"] <- "Portugal" df$country[df$country== "Romanian"] <- "Romania" df$country[df$country== "Scottish"] <- "Scotland" df$country[df$country== "Yugoslav"] <- "Yugoslavia" df$country[df$country== "Russia"] <- "USSR" df$country[df$country== "Canadian"] <- "Canada" df$country[df$country== "Norwegian"] <- "Norway" df$country[df$country== "Paraguayan"] <- "Paraguay" df$country[df$country== "Ecuadorian"] <- "Ecuador" df$country[df$country== "Slovenian"] <- "Slovenia" df$country[df$country== "Israeli"] <- "Israel" df$country[df$country== "Thai"] <- "Thailand" df$country[df$country== "Korean"] <- "Korea" df$country[df$country== "Osaka"] <- "Japan" df$country[df$country== "Taiwanese"] <- "Taiwan" df$country[df$country== "Anglo-Irish"] <- "Ireland" df$country[df$country== "English"] <- "UK" df$country[df$country== "Irish"] <- "Ireland" df$country[df$country== "Ethiopian"] <- "Ethiopia" df$country[df$country== "Panamanian"] <- "Panama" df$country[df$country== "Cambodian"] <- "Cambodia" df$country[df$country== "Slovak"] <- "Slovakia" df$country[df$country== "Bulgarian"] <- "Bulgaria" df$country[df$country== "Guyanese"] <- "Guyana" df$country[df$country== "Costa"] <- "Costa Rica" df$country[df$country== "Sudanese"] <- "Sudan" df$country[df$country== "Senegalese"] <- "Senegal" df$country[df$country== "Malian"] <- "Mali" df$country[df$country== "Georgian"] <- "Georgia" df$country[df$country== "Kenyan"] <- "Kenya" df$country[df$country== "Nigerian"] <- "Nigeria" df$country[df$country== "Barcelona"] <- "Spain" df$country[df$country== "Azerbaijani"] <- "Azerbaijan" df$country[df$country== "Malaysian"] <- "Malaysia" df$country[df$country== "Lithuaniann"] <- "Lithuania" df$country[df$country== "Namibian"] <- "Namibia" df$country[df$country== "Ghanaian"] <- "Ghana" df$country[df$country== "Filipino"] <- "Philippines" df$country[df$country== "Kyrgyzstani"] <- "Kyrgyzstan" df$country[df$country== "Ugandan"] <- "Uganda" df$country[df$country== "Zagreb"] <- "Croatia" df$country[df$country== "Kazakhstani"] <- "Kazakhstan" df$country[df$country== "Lebanese"] <- "Lebanon" df$country[df$country== "Singaporean"] <- "Singapore" df$country[df$country== "Nowegian"] <- "Norway" df$country[df$country== "Afgan"] <- "Afghanistan" df$country[df$country== "Cameroonian"] <- "Cameroon" df$country[df$country== "Czechoslovakian"] <- "Czechoslovakia" df$country[df$country== "Mauritanian"] <- "Mauritania" df$country[df$country== "Syrian"] <- "Syria" df$country[df$country== "Rwandan"] <- "Rwanda" df$country[df$country== "Albanian"] <- "Albania" df$country[df$country== "Ivorian"] <- "Côte d'Ivoire" df$country[df$country== "Serbian"] <- "Serbia" df$country[df$country== "Tajik"] <- "Tajikistan" df$country[df$country== "Welsh"] <- "Wales" df$country[df$country== "Saudi"] <- "Saudi Arabia" df$country[df$country== "Iraqi"] <- "Iraq" df$country[df$country== "Palestinian"] <- "Palestine"

Making sure that country of origin - not nationality - is mapped ("born in" has higher priority than stated nationality)

df <- within(df, country1 <- ifelse(is.na(country1), country, country1))

df$weight <- rep(1, nrow(df)) map.df <- tapply(df$weight, df$country1, sum) # now, the amount of paintings per country is to be calculated map.df <- data.frame(template=names(map.df),sum=map.df) colnames(map.df) <- c("country", "Paintings") View(map.df)

map.data <- map_data("world") # now, the two datasets (1: long/lat, 2:number of paintings) can be merged colnames(map.data) <- c("long", "lat", "group", "order", "country", "subregion")

map.df$country <- as.character(map.df$country) map.data$country <- as.character(map.data$country) merge.map <- merge(map.data, map.df, all.x =T) # all.x=TRUE keeps unmatched rows

merge.map <- arrange(merge.map, group, order) merge.map$logpainting<-log(merge.map$Paintings) # As the USA is a major outlier

mapping the countries where the artists in MOMAs collection were born

p7 <- ggplot(merge.map, aes(x = long, y = lat, group=group)) + geom_polygon(aes(fill = logpainting), colour = "black") + expand_limits() + scale_fill_gradient("Paintings (log)") + theme_minimal() + labs(x = NULL, y = NULL, title = "Total amount of paintings per country at MOMA")

p7

QUESTION 8

Extracting the width and the length of the paintings (if only measurement = ", => NA)

df$width <- strextract(df$Dimensions, "(.[0-9] x") df$width <- sub("(", "", df$width) df$width <- sub(" ._", "", df$width) df$width <- as.numeric(df$width) head(df$width)

df$hight <- strextract(df$Dimensions, "(.[0-9].)") df$hight <- sub(").", "", df$hight) df$hight <- strextract(df$hight, " .[0-9] cm") df$hight <- sub("x ", "", df$hight) df$hight <- sub(" cm", "", df$hight) df$hight <- sub(" x.*", "", df$hight) df$hight <- as.numeric(df$hight)

calculating the area (cm)

df$area <- df$hight * df$width

plot.area <- data.frame(df$Title, df$Artist, df$area) colnames(plot.area) <- c("title", "artist", "area") plot.area <- arrange(plot.area, desc(area)) plot.area[1:10,]

plotting the results

plot.area1 <- arrange(plot.area, desc(area)) plot.area1 <- plot.area[1:10,] plot.area1$ranking <- c(10:1) plot.area1$help <- rep(-0.25, nrow(plot.area1)) plot.area1$nameloc <- plot.area1$ranking - 0.3 ranking.lab <- c("Largest Painting in \n MOMAs Collection", 2, 3, 4, 5, 6, 7, 8, 9, 10) ranking.pos <- c(10, 9, 8, 7, 6, 5, 4, 3, 2, 1) y1 <- c(-0.15,-0.15,-0.15,-0.15,-0.15,-0.15,-0.15,-0.15,-0.15,-0.15)

attach(plot.area1)

p8 <- ggplot() + geom_point(aes(x = ranking, y = help), alpha=0.5, size = area/40000, col="red") + theme_minimal() + coord_flip() p8 <- p8 + xlim(0,10.3) + ylim(-0.50,0.50) p8 <- p8 + annotate("text", label = title, x = ranking, y = y1, size = 6, angle=0, col="red", fontface="bold", alpha=.5, hjust = 0) p8 <- p8 + theme(axis.line=element_blank(), axis.text.x=element_blank(), axis.text.y=element_blank(), axis.ticks=element_blank(), axis.title.x=element_blank(), axis.title.y=element_blank())

p8 <- p8 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) p8 <- p8 + annotate("text", label = ranking.lab, x = ranking.pos, y = -0.45, size = 6, fontface="bold", alpha=.5) p8 <- p8 + annotate("text", label = plot.area1$artist, x = plot.area1$nameloc, y = y1, size=3, hjust = 0) p8

sebastianbarfort commented 9 years ago

Generally a good assignment.

You might want to consider using the dplyr package more systematically throughout your R code. I find that it makes doing data manipulation much easier.

Not the most efficient way you go about question 7. Check my solution.

APPROVED