sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Group 22: Assignment 1 #10

Closed LouisePoulsen closed 9 years ago

LouisePoulsen commented 9 years ago

Assignment 1

This assignment has been completed by group 22:

Kaspar Pugesgaard, Kasper Wetterslev,

Line Rasmussen and Louise Poulsen.

The dataset contains over 120.000 reords of the works that

have been accessioned into MoMA's collection.

Loading relevant libraries.

library("readr") library("dplyr") library("ggplot2") library("zoo") library("lubridate") library("stringr") library("tidyr") library("rvest") library("XML") library("countrycode") library("maptools") library("ggmap") library("mapproj")

Loads the data.

df = read_csv("https://raw.githubusercontent.com/MuseumofModernArt/collection/master/Artworks.csv")

We assume, that paintings refer to any type of art, and do not filter

the data based on the classification variable.

Question 1

Create a new dataframe of the stock of paintings at MOMA for each

month in the year. # We assume, that the question means the stock

of paintings every month of every year.

Using the zoo packages creates a date only consisting of month and

year (the variable assumes the value of the 1st of every month).

df$year_month <- as.Date(as.yearmon(df$DateAcquired))

Removes  from column names.

names(df) <- gsub("", "", names(df))

Removes observations with no date.

df_date <- na.omit(df, "DateAcquired")

Counts the number of art pieces acquired at MoMA per month and year.

df_count_date <- df_date %>% group_by(year_month) %>% summarise(number=n())

Finds the cumulative number of art pieces.

The data is already sorted, so we do not need to sort before accumulating

df_count_date$csum <- ave(df_count_date$number,FUN=cumsum)

Joins the (cumulative) number of art pieces acquired at Moma pr. month and

year onto the full dataset

df_count_date_joined <- left_join(df_date, df_count_date, by="year_month") df_count_date_joined <- NA

Question 2

Use ggplot2 and your new data frame to plot the the stock of paintings

on the y-axis and the date on the x-axis. What kind of geom do you

think is appropriate? why? Color the geom you have chosen red. Add a

title and custom axis labels.

Plots the data using ggplot

p = ggplot(df_count_date, aes(x=year_month, y=csum)) p = p + geom_line(color="red" ) p = p + labs(title="Stock of Art Pieces in MoMA by date", x="Date", y="Stock of Art Pieces") p = p + theme_minimal() + theme(plot.title = element_text(size = rel(1.5), color = "red")) p

We've chosen to use geom_line as we're displaying the development

of an almost countinuous variable over a long period of time.

Another posibility is making a bar plot, to underline that this

is a stock value for every month.

p = ggplot(df_count_date, aes(x=year_month, y=csum)) p = p + geom_bar(color="red", stat = "identity") p = p + labs(title="Stock of Art Pieces in MoMA by date", x="Date", y="Stock of Art Pieces") p = p + theme_minimal() + theme(plot.title = element_text(size = rel(1.5), color = "red")) p

Question 3

Create the same plot but this time the color should reflect the stock

of paintings for curator approved and non-curator approved paintings,

respectively.

Counts the number of art pieces acquired at MoMA per month, year,

and curator approvement and finds the cumulative number of art pieces

by curator.

df_count_date_curator <- df_date %>% group_by(year_month, CuratorApproved) %>% summarise(number=n()) %>% ungroup %>% group_by(CuratorApproved) %>% mutate(cur_csum = cumsum(number))

Joins the (cumulative) number of paintings acquired at Moma pr. month

and year onto the full dataset.

df_count_date_joined_curator <- left_join(df_date, df_count_date_curator, by="year_month","CuratorApproved")

Plots the data using ggplot

p= ggplot(df_count_date_curator, aes(x=year_month, y=cur_csum, color=CuratorApproved)) p= p + geom_line( ) p=p + labs(title="Stock of Art Pieces in MoMA by date and curator approvement", x="Date", y="Stock of Art Pieces") p= p + theme_minimal() p

Question 4

Create a new dataframe of the stock of paintings grouped by what

department the painting belongs to.

Counts the number of art pieces acquired at MoMA per month, year,

and department approvement, and finds the cumulative number of art

pieces by department.

df_count_date_department <- df_date %>% group_by(year_month, Department) %>% summarise(number=n()) %>% ungroup %>% group_by(Department) %>% mutate(dep_csum = cumsum(number))

Joins the (cumulative) number of paintings acquired at Moma pr.

month and year onto the full dataset.

df_count_date_joined_department <- left_join(df_date, df_count_date_department, by="year_month","Department")

Question 5

Plot this dataframe using ggplot2. Which department has had

the highest increase in their stock of paintings?

Plots the number of art pieces by date and department.

p= ggplot(df_count_date_department, aes(x=year_month, y=dep_csum, color=Department)) p= p + geom_line( ) p= p + labs(title="Stock of Art Pieces in MoMA by date and department", x="Date", y="Stock of Art Pieces") p= p + theme_minimal() p

Prints & Illustrated Books has had the highest increase since 1940.

Question 6

Write a piece of code that counts the number of paintings by each

artist in the dataset. List the 10 painters with the highest number

of paintings in MoMA's collection.

Counts the number of paintings by each artist and saves these in the

dataframe "artists" for the classification "Painting".

artists <- filter(df_date, df_date$Artist!= "", df_date$Classification == "Painting") %>% group_by(Artist) %>% summarise(count=n())

Sorts the data descending by count and prints the first 10 observations.

artists <- arrange(artists, desc(count)) head(artists, 10)

We can see that Picasso has the most paintings at MOMA.

Question 7

The variable ArtistBio lists the birth place of each painter. Use

this information to create a world map where each country is colored

according to the stock of paintings in MOMA's collection.

If not born.

df$birth1 <- sub(",", "", sub(" ._" ,"", strextract(df$ArtistBio, "[A-Z].[a-z]")))

If born is followed by numbers.

df$birth2 <- str_extract(str_extract(strextract(df$ArtistBio, ".* born [1-9]"), "[A-Z].,"), "[A-Z]._[a-z]")

If born is followed by country.

df$birth3 <- str_trim(str_extract(strextract(df$ArtistBio, "born ."), " [A-Z]._[a-z] "))

Using ifelse function to combine the three scenarios.

df$b2b3<-ifelse(is.na(df$birth3),df$birth2,df$birth3) df$birth<-sub("/.","",sub("established.","",sub("â.","",sub(" and.","",sub(" [1-9].","",sub(" [(].","",sub("[.]","",sub("[)]","",sub(",.*","",ifelse(is.na(df$b2b3),df$birth1,df$b2b3))))))))))

Scraping data to be able to map nationality with country

nat1 <- html("https://www.englishclub.com/vocabulary/world-countries-nationality.htm") %>% html_nodes("td:nth-child(2)") %>%
html_text() %>% as.character() nat2 <- html("https://www.englishclub.com/vocabulary/world-countries-nationality.htm") %>% html_nodes("td:nth-child(1)") %>%
html_text() %>% as.character()

Creating a data fram for the scraped data

dfnat<-data.frame(nat1,nat2)

Renaming the variables

names(dfnat)=c("birth", "country")

Cleaning data

dfnat$country<-sub(",.*","",dfnat$country)

Cleaning data

dfnat$country<-sub(" [(].*","",dfnat$country)

Adding rows to the datafram to get the right mapping

dfnat <- rbind(dfnat,data.frame(birth="American",country="USA")) dfnat <- rbind(dfnat,data.frame(birth="Luxembourgish",country="Luxembourg")) dfnat <- rbind(dfnat,data.frame(birth="Argentine",country="Argentina")) dfnat <- rbind(dfnat,data.frame(birth="Croatian",country="Croatia")) dfnat <- rbind(dfnat,data.frame(birth="Bohemia",country="Czech Republic")) dfnat <- rbind(dfnat,data.frame(birth="New Zealander",country="New Zealand")) dfnat <- rbind(dfnat,data.frame(birth="USSR",country="Russia"))

Deleting row since dutch is mention 2 times and we only want it to map to Netherlands and not Holland

dfnat<-dfnat[-c(69), ]

Creating a new data frame by merging the existing dataset with the newly scraped data

birth_mapping<-full_join(df,dfnat, by=c("birth"))

Defining the country variable as character

birth_mapping$country<-as.character(birth_mapping$country)

Using the ifelse function to choose the country instead of nationality

birth_mapping$finalised<-ifelse(is.na(birth_mapping$country),birth_mapping$birth,birth_mapping$country)

Removing missing values

birth_mapping = birth_mapping %>% filter(!is.na(finalised))

Creating dataframe of stock by artist birthplace

Stockbyartistbirth=group_by(birth_mapping,finalised) %>% tally(sort=TRUE)

Removing Active, various and Nationality unknown

Stockbyartistbirth<-Stockbyartistbirth[-c(17,72,111), ]

Creating a dataframe with world lattitude and longitude data

world_data = map_data("world") head(world_data)

names(world_data)=c("long","lat","group","order","finalised","subregion")

Stockbyartistbirth.merge = left_join(world_data, Stockbyartistbirth, "finalised")

Plotting the data

p = ggplot(data=Stockbyartistbirth.merge, aes(x = long, y = lat, group = group)) + geom_polygon(aes(fill = n)) + expand_limits() + theme_minimal() p

Question 8

The Dimensions variable lists the dimensions of each painting.

Use your data manipulation skills to calculate the area of each

painting (in cm's). Create a data frame of the five largest and

five smallest paintings in MOMA's collection.

Only looks at the category "paintings".

df_date_paintings <- filter(df_date, Classification=="Painting")

Finds the measurement of the painting in centimeter.

df_datepaintings$centimeter <- sub(".[(]", "",df_date_paintings$Dimensions) df_date_paintings$cm2 <- sub(" cm)" , "" , df_date_paintings$centimeter) df_datepaintings$bredde <- sub("[x].", "", df_date_paintings$cm2) df_date_paintings$laengde <- sub(".*[x]", "", df_date_paintings$cm2)

Changes the width and length to numeric class.

df_date_paintings$num_laengde <- as.numeric(df_date_paintings$laengde) df_date_paintings$num_bredde <- as.numeric(df_date_paintings$bredde)

Finds the area of the (square) paintings.

df_date_paintings$area <- df_date_paintings$num_laengde*df_date_paintings$num_bredde

Only keeps the two relevant variables.

variables <- c("Title", "area") df_date_paintings <- df_date_paintings[variables]

Sorts data by area of the painting to find the ten smallest paintings

df_painting_arrange <- arrange(df_date_paintings, area)

Prints the first ten observations - the ten smallest paintings

head(df_painting_arrange, 10)

Sorts data by area of the painting to find the ten largest paintings

df_painting_arrange_desc <-arrange(df_date_paintings, desc(area))

Prints the first ten observations - the ten largest paintings

head(df_painting_arrange_desc , 10)

sebastianbarfort commented 9 years ago

Very very nice job.

You're making great use of dplyr verbs, great plots, and great use of scraping using rvest (although html is now read_html).

Keep up the good work!

APPROVED