sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Group 3: Assignment 1 #15

Closed emilbrodersen closed 9 years ago

emilbrodersen commented 9 years ago

library("readr") library("lubridate") library("zoo") library("dplyr") library("ggplot2") library("stringr") library("countrycode") library("maps")

rm(list=ls())

Read the data

moma = read_csv("https://raw.githubusercontent.com/MuseumofModernArt/collection/master/Artworks.csv")

Question 1

We use the as.yearmon from the "zoo" package to represent "DateAcquired"

as monthly data and convert it to "date" class using as.Date.

Then we create a variable "stock" that counts the amount of using the

count function from the dplyr package.

moma$month = as.Date(as.yearmon(moma$DateAcquired))

Now we define a new dataframe called moma.paintings. We use the dplyr package

to filter the data such that we only have the "Painting" Classification

and observations with month acquired data available. By first using the

group_by function to group the data in months, we can summarise the number of

observations in each month. Finally, we use the mutate function to create

a variable named "Total" that accumulates the number of paintings.

moma.paintings = moma %>% filter(!is.na(moma$month), Classification == "Painting") %>% group_by(month) %>% summarise(PaintingsAcquired=n()) %>% mutate(Total = cumsum(PaintingsAcquired))

Question 2

We plot the data using a geom_line since it is appropriate for

illustrating the evolution of the stock of paintings.

p_Q1 = ggplot(data = moma.paintings, aes(x = month, y = Total )) p_Q1 + geom_line(color = "Red") + xlab("Date") + ylab("Number of Paintings") + ggtitle("MoMa Stock of Paintings Since 1929") + theme_minimal()

Question 3

We define a new dataframe moma.curator which contains all the paintings

that has a date acquired observation and group them by month and currator

approval. We sum the amount of paintings in these groups acquired every month.

moma.curator <- moma %>% filter(!is.na(DateAcquired), Classification == "Painting") %>% group_by(month,CuratorApproved) %>% summarise(PaintingsAcquired=n())

We then move on to group the data by Curator Approval and sum the amount of

paintings in each category. "Approved" or "not".

moma.curator1 <- moma.curator %>% group_by(CuratorApproved) %>% mutate(Total = cumsum(PaintingsAcquired ))

We now plot the data

p_Q3 = ggplot(data=moma.curator1, aes(x=as.numeric(month), y=Total, color=CuratorApproved)) p_Q3 = p + geom_line() p_Q3 = p + labs(title="Number of paintings in MOMA since 1929", x="Date", y="Number of Paintings") p_Q3

Question 4

We create a new dataframe named "moma_departement" by filtering the original

data such that we only have paintings that are registered to a departement.

We againg use the group_by function to first summarise the amount of paintings

acquired for each month by department. Then we group the data by department

and sum the observations using cumsum.

moma_department <- moma %>% filter(!is.na(Department), !is.na(month),Classification=="Painting") %>% group_by(month,Department) %>% summarise(n=n()) %>% group_by(Department) %>% mutate(stock = cumsum(n))

Question 5

The plot shows that Department "Painting and Sculpture" has almost all paintings.

p_Q5 <- ggplot(data=moma_department, aes(x=month,y=stock, color=Department)) p_Q5 <- p_Q5 + geom_line() + scale_y_continuous("Stock of paintings") + scale_x_date("") p_Q5 <- p_Q5 + theme_minimal()+ggtitle("MoMA's paintings since 1929 : By department") p_Q5

Question 6

We create a dataframe named moma_painters that counts the number of paintings in

the moma stock by each artist. First we filter the NA observations out and

make sure we are only dealing with paintings. Then we group the observations

by "Artist" and summarise the amount of works by each artist.

We print the ten artists with the most works by using the "head" function.

moma_painters <- moma %>% filter(!is.na(Artist),Classification== "Painting") %>% group_by(Artist) %>% summarise(n=n()) %>% arrange(-n)

head(moma_painters, n=10)

Question 7

First we create a dataset named moma_birthplace in which we remove

the observations with no "Artist" or "ArtistBio" and remove all non-paintings.

We select only the variables "Artist" and "ArtistBio"

moma_birthplace = moma %>% filter(Artist!="", ArtistBio!="", Classification == "Painting") %>% count(Artist, ArtistBio)

We define a new variable in which we remove the parenthesis' using gsub.

moma_birthplace$Bio = gsub("(|)", "", moma_birthplace$ArtistBio)

Then we extract only the part of the new variable "Bio" that is the first

word beginning with a capital letter followed by low case letters,

since this is how the nationalities are stated in "ArtistBio".

Then we filter out the observations that are not nationalities

moma_birthplace$Bio = str_extract(moma_birthplace$Bio, "[A-Z].[a-z]+") moma_birthplace = moma_birthplace %>% filter(Bio != "Nationality" & Bio != "Various")

We create a dataset that summarises all the nationalities to get an overview.

countries <- moma_birthplace %>% group_by(Bio) %>% summarise(n =n())

We then create a character vector with all the 53 nationalities in our data..

Nationality <- c( "American" , "Argentine", "Australian", "Austrian", "Belgian", "Bolivian", "Brazilian", "British", "Canadian", "Chilean", "Colombian", "Congolese", "Croatian", "Cuban",
"Czech", "Danish", "Dutch", "French", "German", "Ghanaian", "Great",
"Guatemalan", "Guyanese", "Haitian", "Hungarian", "Icelandic", "Indian", "Iranian",
"Irish", "Israeli", "Italian", "Japanese", "Korean", "Mexican", "Moroccan",
"Nicaraguan", "Norwegian", "Peruvian", "Polish", "Romanian", "Russian", "South",
"Spanish", "Sudanese", "Swedish", "Swiss", "Tanzanian", "Turkish", "Ukrainian", "Uruguayan", "Venezuelan", "Yugoslav", "Zimbabwean")

..aswell as a character vector consisting of all matching "Country names"

Nation = c( "US" , "Argentina", "Australia", "Austria", "Belgium", "Bolivia", "Brazil", "Britain", "Canada", "Chile", "Colombia", "Congo", "Croatia", "Cuba",
"Czech Republic", "Denmark", "Netherlands", "France", "Germany", "Ghana", "Great Britain",
"Guatemala", "Guyana", "Haiti", "Hungary", "Iceland", "India", "Iran",
"Ireland", "Israel", "Italy", "Japan", "Korea", "Mexico", "Morocco",
"Nicaragua", "Norway", "Peru", "Poland", "Romania", "Russia", "South Africa",
"Spain", "Sudan", "Sweden", "Switzerland", "Tanzania", "Turkey", "Ukraine", "Uruguay", "Venezuela", "Yugoslavia", "Zimbabwe")

We transform the "Bio" variable in moma_birthplace into the right Country name

using a loop.

for (i in 1:53){ moma_birthplace$Bio[moma_birthplace$Bio == Nationality[i]] = Nation[i] }

We then add an iso2c variable to moma_birthplace using the "countrycode"

moma_birthplace$iso2c = countrycode(moma_birthplace$Bio, origin= "country.name", destination = "iso2c")

moma_birthplace = moma_birthplace %>% count(iso2c)

We then prepare to plot the data on a world map by creating a dataframe "map"

which contains world coordinates and Country names, which we convert into

iso2c to merge with our data.

map = map_data("world") map$iso2c = countrycode(map$region, origin= "country.name", destination = "iso2c")

We join the data by iso2c using the left_join function

moma_map = left_join(moma_birthplace,map)

We plot the data coloured by the log of the sum of paintings from each country.

p = ggplot(moma_map, aes(x = long, y = lat, group = group, fill = log(n))) p + geom_polygon() + scale_fill_continuous(low="thistle", high="blue", guide="colorbar", na.value="grey") + expand_limits(x = moma_map$long, y = moma_map$lat) + labs(title = "MOMA's Stock of Paintings: Nationality of Author - World Map (log)")

Question 8

We first make the width variable in our dataset

moma$width <- str_extract(moma$Dimensions," ([0-9].+ *x") moma$width <- gsub("x","",moma$width) moma$width <- as.numeric(gsub("(","",moma$width))

Then we extract the height of each painting

moma$height <- str_extract(moma$Dimensions," ([prod0-9].+ cm") moma$height <- gsub("([0-9].+ x","",moma$height) moma$height <- as.numeric(gsub("cm","",moma$height))

Finally we calculate the surface area of each painting by using the mutate

function to take the product of height and width.

moma <- mutate(moma, area= height*width)

sebastianbarfort commented 9 years ago

Generally a very good assignment.

Very good comments and nice use of dplyr verbs.

There's a small error in Q3. Make sure you know why.

APPROVED