SOCIAL DATA SCIENCE: Assignment 1

Reading packages

library("readr") library("lubridate") library("dplyr") library("ggplot2") library("scales") library("zoo") library("stringr") library("maps") library("knitr")

Read data

df <- read_csv("https://raw.githubusercontent.com/MuseumofModernArt/collection/master/Artworks.csv")

View data

View(df)

Assumption: We assume that what is meant in allp questions is all pieces at MOMA and not only paintings, as question 4-5 otherwise will be impossible to solve, as there is only one department, which has paintings.

QUESTION 1

We have two solutions as the question could be understood both as: 1) Creating a dataframe of the stock of painting at MOMA for each month, each year AND 2) Creating a dataframe of the stock of paintings at MOMA for each month in the year. We will create both dataframes and will use both of these dataframes as it makes sense. You will therefore find two solutions for some of the questions

Solution 1

Arranging the data into intuitive time range (for later plotting)

df1 <-df df1 <- arrange(df1, DateAcquired)

Generate variable containing all months from November 1929 to June, 2015

months<-seq(as.Date("1929/11/01"), as.Date("2015/06/01"), "month") dat<-data.frame(months) dat$month <- month(dat$months) dat$year <- year(dat$months) dat$date <- as.yearmon(paste(dat$year, dat$month, sep = "-"))

The month/year is extracted from the date variable using the "month" fct in the package "lubridate" and put together in a month/year variable

df1$month <- month(df1$DateAcquired) df1$year <- year(df1$DateAcquired) df1$date <- as.yearmon(paste(df1$year, df1$month, sep = "-"))

Create variable for the stock within each month: creating individual weight to each painting

df1$weight <- rep(1, nrow(df1))

using "tapply" function to sum all paintings within same month/year, saved in "stock.month" variable

stock <- tapply(df1$weight, df1$date, sum) stock <- cumsum(stock) date <-unique(df1$date) date <- na.omit(date) stock <- data.frame(date, stock)

Merge the "all months from 1929-2015" variable with information on the stock expansion in individual months

class(stock$date) class(dat$date) stock$date <- as.character(stock$date) dat$date <- as.character(dat$date) new.df1<- left_join(dat, stock)

Finally, before being able to plot all months since 1929 against the expansion in stock of painting for MOMA, the months were no expansion takes place need to get the value of the current stock (as of now all months where no expansion takes place have the value NA, but the stock is in fact in place)

new.df1<-na.locf(new.df1, fromLast = TRUE)

Solution 2

Splitting 'DateAcquired' into two variables - one for month, one for year

df2 <- df df2$Month <- format(df2$DateAcquired, "%m") df2$Year <- format(df2$DateAcquired, "%Y")

Further preparation of data for later use

Making 'CuratorApproved' into a numeric variable

df2$CuratorApproved2 = ifelse(df$CuratorApproved == "Y", 2, 1)

Removing observations with missing values for relevant variables

df3 <- na.omit(df2[,c("Month", "CuratorApproved2", "CuratorApproved", "Department", "Year")])

QUESTION 2

Solution 1

new.df1$date<-as.yearmon(new.df1$date) new.df1$stock <- as.numeric(new.df1$stock)

Creating histogram of stock of painting per month, inserting title, renaming x- and y-axis and changing color into red

p <- ggplot() + geom_point(aes(x = as.Date(new.df1$date), y =new.df1$stock), alpha=0.5, size=0.2, col="red") p <- p + geom_smooth(method = "lm", aes(x = as.Date(new.df1$date), y =new.df1$stock)) p <- p + theme_minimal() p <- p + labs(title = "Development in MOMA's stock over time \n") + xlab("\n Development over time (monthly)") + ylab("Stock of\n paintings \n") p <- p + theme(legend.position = "none") p <- p + theme(axis.title.y = element_text(angle = 360)) p <- p + theme(axis.text.x = element_text(size=11, angle=70)) p <- p + scale_x_date(breaks = pretty_breaks(15)) p

Solution 2

Creating histogram of stock of painting per month, inserting title, renaming x- and y-axis and changing color into red

p <- ggplot(df3, aes(x = df3$Month)) p <- p + labs(title = "Stock of paintings per month", x = "Month", y = "Number of paintings") p + geom_histogram(fill = "red") + theme_minimal()

QUESTION 3

Solution 2

Creating histogram showing stock of painting per month for curator-approved and non-curator approved painting, respectively

p1 <- ggplot(df3, mapping = aes(x = df3$Month,fill=df3$CuratorApproved)) p1 + layer(geom = "histogram", position = "stack") + labs(title = "Stock of paintings per month", x = "Month", y = "Number of paintings", fill = "Approved by curator?") + theme_minimal()

Showing share indstead of amount

p2 <- ggplot(df3, mapping = aes(x = df3$Month,fill=df3$CuratorApproved)) p2 + layer(geom = "histogram", position = "fill") + labs(title = "Stock of paintings per month", x = "Month", y = "Number of paintings", fill = "Approved by curator?") + theme_minimal()

QUESTION 4

Solution 1

Calculating the total number of paintings for each department

df_department1 <-tapply(df1$weight, df1$Department, sum)

turning the output vector into a dataframe for plotting

df_department1 <- data.frame(template=names(df_department1),sum=df_department1) View(df_department1)

Solution 2

Grouping data by department, creating new dataframe

df_department2 = df3 %>% group_by(Department) View(df_department2)

QUESTION 5

Solution 1

Plotting stock of paintings per year for each department

p5 <- ggplot(df_department1, aes(x = reorder(template, sum), y = sum)) p5 <- p5 + geom_bar(stat="identity", alpha=.4, fill="red") + coord_flip() p5 <- p5 + labs(title = "Increase in stock from 1929-2015 by MOMA's departments", x = "Department", y = "Stock") p5 + theme_minimal()

Solution 2

Plotting stock of painting per year for each department

p = ggplot(df_department2, aes(x = df_department2$Year)) p = p + labs(title = "Stock of paintings per Year", x = "Year", y = "Number of paintings") p + geom_bar() + facet_wrap(~ Department, scale = "free") + theme_minimal()

QUESTION 6

Setting blanks and "Unknown photographer" as missing

df$Artist[df$Artist==""] <- NA df$Artist[df$Artist=="Unknown photographer"] <- NA

using pipes to manipulate the data

df %>% filter(!is.na(Artist)) %>% group_by(Artist) %>% summarise(n = n()) %>% arrange(desc(n)) %>% View()

QUESTION 7

Creating variable containing information on birth place of each painter, using the variable ArtistBio

df$country <- strextract(df$ArtistBio, "(.[a-z]") df$country1 <- strextract(df$country, "born .[a-z]")

df$country <- sub(" .*", "", df$country) df$country <- sub(",", "", df$country) df$country <- sub("(", "", df$country) df$country <- sub(")", "", df$country) df$country <- sub(" ", "", df$country)

df$country1 <- sub(".* ", "", df$country1) df$country1 <- sub("(", "", df$country1) df$country1 <- sub(")", "", df$country1) df$country1 <- sub(" ", "", df$country1)

help.code <- unique(df$country) print(help.code)

Recoding country variable

df$country[df$country== "American"] <- "USA" df$country[df$country== "German"] <- "Germany" df$country[df$country== "Austrian"] <- "Austria" df$country[df$country== "Vienna"] <- "Austria" df$country[df$country== "French"] <- "France" df$country[df$country== "Bolivian"] <- "Bolivia" df$country[df$country== "Swiss"] <- "Switzerland" df$country[df$country== "Spanish"] <- "Spain" df$country[df$country== "Catalan"] <- "Spain" df$country[df$country== "Dutch"] <- "Netherlands" df$country[df$country== "Russian"] <- "Russia" df$country[df$country== "Italian"] <- "Italy" df$country[df$country== "British"] <- "UK" df$country[df$country== "Swedish"] <- "Sweden" df$country[df$country== "Mexican"] <- "Mexico" df$country[df$country== "Polish"] <- "Poland" df$country[df$country== "Ukrainian"] <- "Ukraine" df$country[df$country== "Uruguayan"] <- "Uruguay" df$country[df$country== "Venezuelan"] <- "Venezuela" df$country[df$country== "Indian"] <- "India" df$country[df$country== "Vietnamese"] <- "Vietnam" df$country[df$country== "Zimbabwean"] <- "Zimbabwe" df$country[df$country== "Tunisian"] <- "Tunisia" df$country[df$country== "Turkish"] <- "Turkey" df$country[df$country== "Brazilian"] <- "Brazil" df$country[df$country== "Japanese"] <- "Japan" df$country[df$country== "Chinese"] <- "China" df$country[df$country== "Belgian"] <- "Belgium" df$country[df$country== "Argentine"] <- "Argentine" df$country[df$country== "Alsace"] <- "France" df$country[df$country== "Chilean"] <- "Chile" df$country[df$country== "Cuban"] <- "Cuba" df$country[df$country== "Danish"] <- "Denmark" df$country[df$country== "Guatemalan"] <- "Guatemala" df$country[df$country== "Colombian"] <- "Colombia" df$country[df$country== "Croatian"] <- "Croatia" df$country[df$country== "Egyptian"] <- "Egypt" df$country[df$country== "Finnish"] <- "Finland" df$country[df$country== "Greek"] <- "Grecce" df$country[df$country== "Haitian"] <- "Haiti" df$country[df$country== "Hungarian"] <- "Hungary" df$country[df$country== "Iranian"] <- "Iran" df$country[df$country== "Latvian"] <- "Latvia" df$country[df$country== "Pakistani"] <- "Pakistan" df$country[df$country== "Peruvian"] <- "Peru" df$country[df$country== "Portuguese"] <- "Portugal" df$country[df$country== "Romanian"] <- "Romania" df$country[df$country== "Scottish"] <- "Scotland" df$country[df$country== "Yugoslav"] <- "Yugoslavia" df$country[df$country== "Russia"] <- "USSR" df$country[df$country== "Canadian"] <- "Canada" df$country[df$country== "Norwegian"] <- "Norway" df$country[df$country== "Paraguayan"] <- "Paraguay" df$country[df$country== "Ecuadorian"] <- "Ecuador" df$country[df$country== "Slovenian"] <- "Slovenia" df$country[df$country== "Israeli"] <- "Israel" df$country[df$country== "Thai"] <- "Thailand" df$country[df$country== "Korean"] <- "Korea" df$country[df$country== "Osaka"] <- "Japan" df$country[df$country== "Taiwanese"] <- "Taiwan" df$country[df$country== "Anglo-Irish"] <- "Ireland" df$country[df$country== "English"] <- "UK" df$country[df$country== "Irish"] <- "Ireland" df$country[df$country== "Ethiopian"] <- "Ethiopia" df$country[df$country== "Panamanian"] <- "Panama" df$country[df$country== "Cambodian"] <- "Cambodia" df$country[df$country== "Slovak"] <- "Slovakia" df$country[df$country== "Bulgarian"] <- "Bulgaria" df$country[df$country== "Guyanese"] <- "Guyana" df$country[df$country== "Costa"] <- "Costa Rica" df$country[df$country== "Sudanese"] <- "Sudan" df$country[df$country== "Senegalese"] <- "Senegal" df$country[df$country== "Malian"] <- "Mali" df$country[df$country== "Georgian"] <- "Georgia" df$country[df$country== "Kenyan"] <- "Kenya" df$country[df$country== "Nigerian"] <- "Nigeria" df$country[df$country== "Barcelona"] <- "Spain" df$country[df$country== "Azerbaijani"] <- "Azerbaijan" df$country[df$country== "Malaysian"] <- "Malaysia" df$country[df$country== "Lithuaniann"] <- "Lithuania" df$country[df$country== "Namibian"] <- "Namibia" df$country[df$country== "Ghanaian"] <- "Ghana" df$country[df$country== "Filipino"] <- "Philippines" df$country[df$country== "Kyrgyzstani"] <- "Kyrgyzstan" df$country[df$country== "Ugandan"] <- "Uganda" df$country[df$country== "Zagreb"] <- "Croatia" df$country[df$country== "Kazakhstani"] <- "Kazakhstan" df$country[df$country== "Lebanese"] <- "Lebanon" df$country[df$country== "Singaporean"] <- "Singapore" df$country[df$country== "Nowegian"] <- "Norway" df$country[df$country== "Afgan"] <- "Afghanistan" df$country[df$country== "Cameroonian"] <- "Cameroon" df$country[df$country== "Czechoslovakian"] <- "Czechoslovakia" df$country[df$country== "Mauritanian"] <- "Mauritania" df$country[df$country== "Syrian"] <- "Syria" df$country[df$country== "Rwandan"] <- "Rwanda" df$country[df$country== "Albanian"] <- "Albania" df$country[df$country== "Ivorian"] <- "Côte d'Ivoire" df$country[df$country== "Serbian"] <- "Serbia" df$country[df$country== "Tajik"] <- "Tajikistan" df$country[df$country== "Welsh"] <- "Wales" df$country[df$country== "Saudi"] <- "Saudi Arabia" df$country[df$country== "Iraqi"] <- "Iraq" df$country[df$country== "Palestinian"] <- "Palestine"

Making sure that country of origin - not nationality - is mapped ("born in" has higher priority than stated nationality)

df <- within(df, country1 <- ifelse(is.na(country1), country, country1))

df$weight <- rep(1, nrow(df)) map.df <- tapply(df$weight, df$country1, sum) # now, the amount of paintings per country is to be calculated map.df <- data.frame(template=names(map.df),sum=map.df) colnames(map.df) <- c("country", "Paintings") View(map.df)

map.data <- map_data("world") # now, the two datasets (1: long/lat, 2:number of paintings) can be merged colnames(map.data) <- c("long", "lat", "group", "order", "country", "subregion")

map.df$country <- as.character(map.df$country) map.data$country <- as.character(map.data$country) merge.map <- merge(map.data, map.df, all.x =T) # all.x=TRUE keeps unmatched rows

merge.map <- arrange(merge.map, group, order) merge.map$logpainting<-log(merge.map$Paintings) # As the USA is a major outlier

mapping the countries where the artists in MOMAs collection were born

p7 <- ggplot(merge.map, aes(x = long, y = lat, group=group)) + geom_polygon(aes(fill = logpainting), colour = "black") + expand_limits() + scale_fill_gradient("Paintings (log)") + theme_minimal() + labs(x = NULL, y = NULL, title = "Total amount of paintings per country at MOMA")

QUESTION 8

Extracting the width and the length of the paintings (if only measurement = ", => NA)

df$width <- strextract(df$Dimensions, "(.[0-9] x") df$width <- sub("(", "", df$width) df$width <- sub(" ._", "", df$width) df$width <- as.numeric(df$width) head(df$width)

df$hight <- strextract(df$Dimensions, "(.[0-9].)") df$hight <- sub(").", "", df$hight) df$hight <- strextract(df$hight, " .[0-9] cm") df$hight <- sub("x ", "", df$hight) df$hight <- sub(" cm", "", df$hight) df$hight <- sub(" x.*", "", df$hight) df$hight <- as.numeric(df$hight)

calculating the area (cm)

df$area <- df$hight * df$width

plot.area <- data.frame(df$ï»¿Title, df$Artist, df$area) colnames(plot.area) <- c("title", "artist", "area") plot.area <- arrange(plot.area, desc(area)) plot.area[1:10,]

plotting the results

plot.area1 <- arrange(plot.area, desc(area)) plot.area1 <- plot.area[1:10,] plot.area1$ranking <- c(10:1) plot.area1$help <- rep(-0.25, nrow(plot.area1)) plot.area1$nameloc <- plot.area1$ranking - 0.3 ranking.lab <- c("Largest Painting in \n MOMAs Collection", 2, 3, 4, 5, 6, 7, 8, 9, 10) ranking.pos <- c(10, 9, 8, 7, 6, 5, 4, 3, 2, 1) y1 <- c(-0.15,-0.15,-0.15,-0.15,-0.15,-0.15,-0.15,-0.15,-0.15,-0.15)

attach(plot.area1)

p8 <- ggplot() + geom_point(aes(x = ranking, y = help), alpha=0.5, size = area/40000, col="red") + theme_minimal() + coord_flip() p8 <- p8 + xlim(0,10.3) + ylim(-0.50,0.50) p8 <- p8 + annotate("text", label = title, x = ranking, y = y1, size = 6, angle=0, col="red", fontface="bold", alpha=.5, hjust = 0) p8 <- p8 + theme(axis.line=element_blank(), axis.text.x=element_blank(), axis.text.y=element_blank(), axis.ticks=element_blank(), axis.title.x=element_blank(), axis.title.y=element_blank())

p8 <- p8 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) p8 <- p8 + annotate("text", label = ranking.lab, x = ranking.pos, y = -0.45, size = 6, fontface="bold", alpha=.5) p8 <- p8 + annotate("text", label = plot.area1$artist, x = plot.area1$nameloc, y = y1, size=3, hjust = 0) p8

sebastianbarfort / sds

Group 18: Assignment 1 #17

SOCIAL DATA SCIENCE: Assignment 1

Reading packages

Read data

View data

Assumption: We assume that what is meant in allp questions is all pieces at MOMA and not only paintings, as question 4-5 otherwise will be impossible to solve, as there is only one department, which has paintings.

QUESTION 1

Solution 1

Arranging the data into intuitive time range (for later plotting)

Generate variable containing all months from November 1929 to June, 2015

The month/year is extracted from the date variable using the "month" fct in the package "lubridate" and put together in a month/year variable

Create variable for the stock within each month: creating individual weight to each painting

using "tapply" function to sum all paintings within same month/year, saved in "stock.month" variable

Merge the "all months from 1929-2015" variable with information on the stock expansion in individual months

Solution 2

Splitting 'DateAcquired' into two variables - one for month, one for year

Further preparation of data for later use

Making 'CuratorApproved' into a numeric variable

Removing observations with missing values for relevant variables

QUESTION 2

Solution 1

Creating histogram of stock of painting per month, inserting title, renaming x- and y-axis and changing color into red

Solution 2

Creating histogram of stock of painting per month, inserting title, renaming x- and y-axis and changing color into red

QUESTION 3

Solution 2

Creating histogram showing stock of painting per month for curator-approved and non-curator approved painting, respectively

Showing share indstead of amount

QUESTION 4

Solution 1

Calculating the total number of paintings for each department

turning the output vector into a dataframe for plotting

Solution 2

Grouping data by department, creating new dataframe

QUESTION 5

Solution 1

Plotting stock of paintings per year for each department

Solution 2

Plotting stock of painting per year for each department

QUESTION 6

Setting blanks and "Unknown photographer" as missing

using pipes to manipulate the data

QUESTION 7

Creating variable containing information on birth place of each painter, using the variable ArtistBio

Recoding country variable

Making sure that country of origin - not nationality - is mapped ("born in" has higher priority than stated nationality)

mapping the countries where the artists in MOMAs collection were born

QUESTION 8

Extracting the width and the length of the paintings (if only measurement = ", => NA)

calculating the area (cm)

plotting the results