Joins the (cumulative) number of art pieces acquired at Moma pr. month and
year onto the full dataset
df_count_date_joined <- left_join(df_date, df_count_date, by="year_month")
df_count_date_joined <- NA
Question 2
Use ggplot2 and your new data frame to plot the the stock of paintings
on the y-axis and the date on the x-axis. What kind of geom do you
think is appropriate? why? Color the geom you have chosen red. Add a
title and custom axis labels.
Plots the data using ggplot
p = ggplot(df_count_date, aes(x=year_month, y=csum))
p = p + geom_line(color="red" )
p = p + labs(title="Stock of Art Pieces in MoMA by date",
x="Date", y="Stock of Art Pieces")
p = p + theme_minimal() +
theme(plot.title = element_text(size = rel(1.5), color = "red"))
We've chosen to use geom_line as we're displaying the development
of an almost countinuous variable over a long period of time.
Another posibility is making a bar plot, to underline that this
is a stock value for every month.
p = ggplot(df_count_date, aes(x=year_month, y=csum))
p = p + geom_bar(color="red", stat = "identity")
p = p + labs(title="Stock of Art Pieces in MoMA by date",
x="Date", y="Stock of Art Pieces")
p = p + theme_minimal() +
theme(plot.title = element_text(size = rel(1.5), color = "red"))
Question 3
Create the same plot but this time the color should reflect the stock
of paintings for curator approved and non-curator approved paintings,
Counts the number of art pieces acquired at MoMA per month, year,
and curator approvement and finds the cumulative number of art pieces
p= ggplot(df_count_date_curator, aes(x=year_month, y=cur_csum,
p= p + geom_line( )
p=p + labs(title="Stock of Art Pieces in MoMA by date and curator approvement",
x="Date", y="Stock of Art Pieces")
p= p + theme_minimal()
Question 4
Create a new dataframe of the stock of paintings grouped by what
department the painting belongs to.
Counts the number of art pieces acquired at MoMA per month, year,
and department approvement, and finds the cumulative number of art
Plot this dataframe using ggplot2. Which department has had
the highest increase in their stock of paintings?
Plots the number of art pieces by date and department.
p= ggplot(df_count_date_department, aes(x=year_month, y=dep_csum,
p= p + geom_line( )
p= p + labs(title="Stock of Art Pieces in MoMA by date and department",
x="Date", y="Stock of Art Pieces")
p= p + theme_minimal()
Prints & Illustrated Books has had the highest increase since 1940.
Question 6
Write a piece of code that counts the number of paintings by each
artist in the dataset. List the 10 painters with the highest number
of paintings in MoMA's collection.
Counts the number of paintings by each artist and saves these in the
dataframe "artists" for the classification "Painting".
Assignment 1
This assignment has been completed by group 22:
Kaspar Pugesgaard, Kasper Wetterslev,
Line Rasmussen and Louise Poulsen.
The dataset contains over 120.000 reords of the works that
have been accessioned into MoMA's collection.
Loading relevant libraries.
library("readr") library("dplyr") library("ggplot2") library("zoo") library("lubridate") library("stringr") library("tidyr") library("rvest") library("XML") library("countrycode") library("maptools") library("ggmap") library("mapproj")
Loads the data.
df = read_csv("")
We assume, that paintings refer to any type of art, and do not filter
the data based on the classification variable.
Question 1
Create a new dataframe of the stock of paintings at MOMA for each
month in the year. # We assume, that the question means the stock
of paintings every month of every year.
Using the zoo packages creates a date only consisting of month and
year (the variable assumes the value of the 1st of every month).
df$year_month <- as.Date(as.yearmon(df$DateAcquired))
Removes  from column names.
names(df) <- gsub("", "", names(df))
Removes observations with no date.
df_date <- na.omit(df, "DateAcquired")
Counts the number of art pieces acquired at MoMA per month and year.
df_count_date <- df_date %>% group_by(year_month) %>% summarise(number=n())
Finds the cumulative number of art pieces.
The data is already sorted, so we do not need to sort before accumulating
df_count_date$csum <- ave(df_count_date$number,FUN=cumsum)
Joins the (cumulative) number of art pieces acquired at Moma pr. month and
year onto the full dataset
df_count_date_joined <- left_join(df_date, df_count_date, by="year_month") df_count_date_joined <- NA
Question 2
Use ggplot2 and your new data frame to plot the the stock of paintings
on the y-axis and the date on the x-axis. What kind of geom do you
think is appropriate? why? Color the geom you have chosen red. Add a
title and custom axis labels.
Plots the data using ggplot
p = ggplot(df_count_date, aes(x=year_month, y=csum)) p = p + geom_line(color="red" ) p = p + labs(title="Stock of Art Pieces in MoMA by date", x="Date", y="Stock of Art Pieces") p = p + theme_minimal() + theme(plot.title = element_text(size = rel(1.5), color = "red")) p
We've chosen to use geom_line as we're displaying the development
of an almost countinuous variable over a long period of time.
Another posibility is making a bar plot, to underline that this
is a stock value for every month.
p = ggplot(df_count_date, aes(x=year_month, y=csum)) p = p + geom_bar(color="red", stat = "identity") p = p + labs(title="Stock of Art Pieces in MoMA by date", x="Date", y="Stock of Art Pieces") p = p + theme_minimal() + theme(plot.title = element_text(size = rel(1.5), color = "red")) p
Question 3
Create the same plot but this time the color should reflect the stock
of paintings for curator approved and non-curator approved paintings,
Counts the number of art pieces acquired at MoMA per month, year,
and curator approvement and finds the cumulative number of art pieces
by curator.
df_count_date_curator <- df_date %>% group_by(year_month, CuratorApproved) %>% summarise(number=n()) %>% ungroup %>% group_by(CuratorApproved) %>% mutate(cur_csum = cumsum(number))
Joins the (cumulative) number of paintings acquired at Moma pr. month
and year onto the full dataset.
df_count_date_joined_curator <- left_join(df_date, df_count_date_curator, by="year_month","CuratorApproved")
Plots the data using ggplot
p= ggplot(df_count_date_curator, aes(x=year_month, y=cur_csum, color=CuratorApproved)) p= p + geom_line( ) p=p + labs(title="Stock of Art Pieces in MoMA by date and curator approvement", x="Date", y="Stock of Art Pieces") p= p + theme_minimal() p
Question 4
Create a new dataframe of the stock of paintings grouped by what
department the painting belongs to.
Counts the number of art pieces acquired at MoMA per month, year,
and department approvement, and finds the cumulative number of art
pieces by department.
df_count_date_department <- df_date %>% group_by(year_month, Department) %>% summarise(number=n()) %>% ungroup %>% group_by(Department) %>% mutate(dep_csum = cumsum(number))
Joins the (cumulative) number of paintings acquired at Moma pr.
month and year onto the full dataset.
df_count_date_joined_department <- left_join(df_date, df_count_date_department, by="year_month","Department")
Question 5
Plot this dataframe using ggplot2. Which department has had
the highest increase in their stock of paintings?
Plots the number of art pieces by date and department.
p= ggplot(df_count_date_department, aes(x=year_month, y=dep_csum, color=Department)) p= p + geom_line( ) p= p + labs(title="Stock of Art Pieces in MoMA by date and department", x="Date", y="Stock of Art Pieces") p= p + theme_minimal() p
Prints & Illustrated Books has had the highest increase since 1940.
Question 6
Write a piece of code that counts the number of paintings by each
artist in the dataset. List the 10 painters with the highest number
of paintings in MoMA's collection.
Counts the number of paintings by each artist and saves these in the
dataframe "artists" for the classification "Painting".
artists <- filter(df_date, df_date$Artist!= "", df_date$Classification == "Painting") %>% group_by(Artist) %>% summarise(count=n())
Sorts the data descending by count and prints the first 10 observations.
artists <- arrange(artists, desc(count)) head(artists, 10)
We can see that Picasso has the most paintings at MOMA.
Question 7
The variable ArtistBio lists the birth place of each painter. Use
this information to create a world map where each country is colored
according to the stock of paintings in MOMA's collection.
If not born.
df$birth1 <- sub(",", "", sub(" ._" ,"", strextract(df$ArtistBio, "[A-Z].[a-z]")))
If born is followed by numbers.
df$birth2 <- str_extract(str_extract(strextract(df$ArtistBio, ".* born [1-9]"), "[A-Z].,"), "[A-Z]._[a-z]")
If born is followed by country.
df$birth3 <- str_trim(str_extract(strextract(df$ArtistBio, "born ."), " [A-Z]._[a-z] "))
Using ifelse function to combine the three scenarios.
df$b2b3<-ifelse($birth3),df$birth2,df$birth3) df$birth<-sub("/.","",sub("established.","",sub("â.","",sub(" and.","",sub(" [1-9].","",sub(" [(].","",sub("[.]","",sub("[)]","",sub(",.*","",ifelse($b2b3),df$birth1,df$b2b3))))))))))
Scraping data to be able to map nationality with country
nat1 <- html("") %>% html_nodes("td:nth-child(2)") %>%
html_text() %>% as.character() nat2 <- html("") %>% html_nodes("td:nth-child(1)") %>%
html_text() %>% as.character()
Creating a data fram for the scraped data
Renaming the variables
names(dfnat)=c("birth", "country")
Cleaning data
Cleaning data
dfnat$country<-sub(" [(].*","",dfnat$country)
Adding rows to the datafram to get the right mapping
dfnat <- rbind(dfnat,data.frame(birth="American",country="USA")) dfnat <- rbind(dfnat,data.frame(birth="Luxembourgish",country="Luxembourg")) dfnat <- rbind(dfnat,data.frame(birth="Argentine",country="Argentina")) dfnat <- rbind(dfnat,data.frame(birth="Croatian",country="Croatia")) dfnat <- rbind(dfnat,data.frame(birth="Bohemia",country="Czech Republic")) dfnat <- rbind(dfnat,data.frame(birth="New Zealander",country="New Zealand")) dfnat <- rbind(dfnat,data.frame(birth="USSR",country="Russia"))
Deleting row since dutch is mention 2 times and we only want it to map to Netherlands and not Holland
dfnat<-dfnat[-c(69), ]
Creating a new data frame by merging the existing dataset with the newly scraped data
birth_mapping<-full_join(df,dfnat, by=c("birth"))
Defining the country variable as character
Using the ifelse function to choose the country instead of nationality
Removing missing values
birth_mapping = birth_mapping %>% filter(!
Creating dataframe of stock by artist birthplace
Stockbyartistbirth=group_by(birth_mapping,finalised) %>% tally(sort=TRUE)
Removing Active, various and Nationality unknown
Stockbyartistbirth<-Stockbyartistbirth[-c(17,72,111), ]
Creating a dataframe with world lattitude and longitude data
world_data = map_data("world") head(world_data)
Stockbyartistbirth.merge = left_join(world_data, Stockbyartistbirth, "finalised")
Plotting the data
p = ggplot(data=Stockbyartistbirth.merge, aes(x = long, y = lat, group = group)) + geom_polygon(aes(fill = n)) + expand_limits() + theme_minimal() p
Question 8
The Dimensions variable lists the dimensions of each painting.
Use your data manipulation skills to calculate the area of each
painting (in cm's). Create a data frame of the five largest and
five smallest paintings in MOMA's collection.
Only looks at the category "paintings".
df_date_paintings <- filter(df_date, Classification=="Painting")
Finds the measurement of the painting in centimeter.
df_datepaintings$centimeter <- sub(".[(]", "",df_date_paintings$Dimensions) df_date_paintings$cm2 <- sub(" cm)" , "" , df_date_paintings$centimeter) df_datepaintings$bredde <- sub("[x].", "", df_date_paintings$cm2) df_date_paintings$laengde <- sub(".*[x]", "", df_date_paintings$cm2)
Changes the width and length to numeric class.
df_date_paintings$num_laengde <- as.numeric(df_date_paintings$laengde) df_date_paintings$num_bredde <- as.numeric(df_date_paintings$bredde)
Finds the area of the (square) paintings.
df_date_paintings$area <- df_date_paintings$num_laengde*df_date_paintings$num_bredde
Only keeps the two relevant variables.
variables <- c("Title", "area") df_date_paintings <- df_date_paintings[variables]
Sorts data by area of the painting to find the ten smallest paintings
df_painting_arrange <- arrange(df_date_paintings, area)
Prints the first ten observations - the ten smallest paintings
head(df_painting_arrange, 10)
Sorts data by area of the painting to find the ten largest paintings
df_painting_arrange_desc <-arrange(df_date_paintings, desc(area))
Prints the first ten observations - the ten largest paintings
head(df_painting_arrange_desc , 10)