sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Group 16: Assignment 1 #28

Closed bjarkedahl closed 9 years ago

bjarkedahl commented 9 years ago

Assignment 1 - Group 16

Read in the datafile

library("readr") df = read_csv("https://raw.githubusercontent.com/MuseumofModernArt/collection/master/Artworks.csv")

---------------------------------------------------------_

Question 1

---------------------------------------------------------_

We check how the first observations for DateAcquired are listed

head(df$DateAcquired) library("lubridate") library("dplyr")

Then we select only the observations classed as Paintings and put them in the dataframe df_stock

df_stock = df %>% filter(Classification == "Painting")

Now we extract the month of DateAcquired into a seperate column so we can group the stock for each month of the year

df_stock = df_stock[!is.na(df_stock$DateAcquired),] # Dataframe only with nonmissing dates of acquired df_stock$ymd = ymd(df_stock$DateAcquired) df_stock$month = month(df_stock$ymd, label = TRUE, abbr = TRUE) #Generating a column - month containing month abbreviations for the figure

---------------------------------------------------------_

Question 2

---------------------------------------------------------_

We now create a histogram for the stock of paintings which counts how many paintings have been acquired

in a given month during the year. We choose a histogram since it nicely depicts that paintings are mostly

acquired in given months. The histogram is colored red and is given a title and labels

library("ggplot2") p = ggplot(data = df_stock, aes(x = month)) p = p + geom_histogram(fill="red") + labs(title = "Stock of paintings acquired in each month", x = "Month", y = "Number of paintings acquired") p = p + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), axis.line = element_line(colour = "black")) plot(p)

---------------------------------------------------------_

Question 3

---------------------------------------------------------_

The same plot is now colored to show if the painting is curator approved or not

this is done by implementing the fill = CuratorApproved statement

p = ggplot(data = df_stock, aes(x = month, fill = CuratorApproved)) p = p + geom_histogram() + labs(title = "Stock of paintings acquired in each month", x = "Month", y = "Number of paintings acquired") p = p + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), axis.line = element_line(colour = "black")) plot(p)

---------------------------------------------------------_

Question 4

---------------------------------------------------------_

A new dataframe containing the stock of paintings grouped by what department they belong to is created and named df_dep

df_dep = df_stock %>% filter(!is.na(Department)) %>% #Removing paintings with missing department group_by(Department)

---------------------------------------------------------_

Question 5

---------------------------------------------------------_

A histogram showing the stock of paintings in each department is created.

To make it easier to read the names of the department, the histograms axes are switched.

The department "Painting and sculpture" had the highest increase in their stock of paintings

This makes perfectly sense, since we are only considering paintings

p = ggplot(data = df_dep, aes(x = Department)) p = p + geom_histogram() + coord_flip() + labs(title = "Stock of paintings in each department", x = "Department", y = "Number of paintings") p = p + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), axis.line = element_line(colour = "black")) plot(p)

---------------------------------------------------------_

Question 6

---------------------------------------------------------_

A tablecontaining the top 10 artist are displaced

sort(table(df_stock$Artist),decreasing=TRUE)[1:10]

---------------------------------------------------------_

Question 7

---------------------------------------------------------_

We start by extracting the nationalities from the Artistbio column

library("stringr") df_stock$Nationality = str_extract(df_stock$ArtistBio, "[A-Z][a-z]+") df_stock = df_stock[!is.na(df_stock$Nationality),] #removing observations with missing nationality

Making sure, that we actually get the artist birth-country by creating a born-variable

df_stock$natio= gsub(",", "", df_stock$ArtistBio) df_stock$natio= gsub("(", "", df_stock$natio) df_stock$natio= gsub(")", "", df_stock$natio) df_stock$born = str_extract(df_stock$natio, "born .* ") df_stock$born2 = gsub("born", "", df_stock$born) df_stock$born = gsub("[0-9]", "", df_stock$born2) df_stock$born = gsub(" ", "", df_stock$born) df_stock$born3 = str_extract(df_stock$born, "[A-Z][a-z]*")

Using ifelse-command to isolate the correct birth-nationality

df_stock$nationality = ifelse(is.na(df_stock$born3), df_stock$Nationality, df_stock$born3) sort(unique(df_stock$nationality)) #we check if some of the nationalities are similarly and should have been the same

We had problems getting the nationalities to country names, so they would be capable of a merge with the maps package

We looked for help at a fellow student (Adam) who allowed us to use his external files from Adam's GitHub repository

Data on Countries and Nationalities, tab-separated from (https://www.englishclub.com/vocabulary/world-countries-nationality.htm) and more.

Nat1 <- read.csv("https://raw.githubusercontent.com/adamingwersen/Data_for_assignment1_SDS/master/Nationalities2.txt", sep="\t", header = FALSE)

Next, we clean up the Nat1 Dataframe by renaming variables, removing the empty column and removing duplicates in Nationality

names(Nat1) = c("Country", "nationality", "") #renaming the first column country, the second column Nationality and the third column blank Nat1 = Nat1[, c("Country", "nationality")] #removing the empty column Nat1[duplicated(Nat1$nationality),] Nat1 = Nat1[!duplicated(Nat1$nationality),] #removing one of the duplicates in Nationality by refering to the country name

Creating a dataframe - df_stock_nat - containing nationality and country of birth of the Artist. This is done by "merging" on nationality

library("dplyr") df_stock_nat = left_join(df_stock, Nat1, by = "nationality") #we do not have country names for all nationalities and there some observations are missing in df_stock_nat compared to df_stock

Since some of our values in the df_stock dataframe had the form of nationality we use this code to keep them in the new dataframe

df_stock_nat$Country2 = ifelse(is.na(df_stock_nat$Country), df_stock_nat$nationality, as.character(df_stock_nat$Country)) as.character(df_stock_nat$Country2)

Creating a new dataframe - map.df - using ggmap's world map

library("ggmap") map.df = map_data("world")

Adding countrycode (iso2c) to the dataframes: map.df and df_stock_nat.df

library("countrycode") map.df$iso2c = countrycode(map.df$region, origin = "country.name", destination = "iso2c") df_stock_nat$iso2c = countrycode(df_stock_nat$Country2, origin = "country.name", destination = "iso2c")

Finding the total number of paintings from each country

library("dplyr") df_stock_map = df_stock_nat %>% select(iso2c) %>% group_by(iso2c) %>% summarise(number = n())

Joining dataframes: df_stock_map and map.df so i can make the map

library("dplyr") df_paintings = inner_join(map.df, df_stock_map, by = "iso2c")

Due to the fact that American artists can be attributed the majority of paintings held in the MoMA, we can use log - however the meaning of this plot is somewhat nonexistent.

library("ggplot2") p = ggplot(df_paintings, aes(x = long, y = lat, group = group, fill = number)) p = p + geom_polygon() p = p + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), axis.line = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank()) p = p + labs(title = "Number of painters from each country") p = p + scale_fill_gradient(low = "#00b3b3", high = "#cccc00", guide = "colorbar") plot(p)

---------------------------------------------------------_

Question 8

---------------------------------------------------------_

library("stringr") library("dplyr")

We start by destringing the dimensions variable so we get a variable ingluding height and one including length

df_stock$dim1 = str_extract_all(df_stock$Dimensions, "([^()]+)"[[1]]) df_stock$dimension_cm = gsub("(", "", df_stock$dim1) df_stock$dimension_cm = gsub(")", "", df_stock$dimension_cm) df_stock$height = word(df_stock$dimension_cm, +1) df_stock$length = word(df_stock$dimension_cm, -2) df_stock$height = as.numeric(df_stock$h) df_stock$length = as.numeric(df_stock$l) df_stock$area = df_stock$length * df_stock$height #calculating the area

We do not get the correct measures but here's how we would display the 5 largest and smallest paintings:

Creating a dataframe including the 5 biggest paintings

df.area1 = dfstock %>% arrange(~ desc(area)) %>% slice(1:5)

Creating a dataframe including the 5 smallest paintings

df.area2 = dfstock %>% arrange(~ (area)) %>% slice(1:5)

Connencting the 2 dataframes into one dataframe including the 5 biggest and 5 smallest paintings

df.area = rbind(df.area1, df.area2)

sebastianbarfort commented 9 years ago

Very well done!

I really like your solution to question 7. You're going about it slightly inefficient, but the idea is very nice.

You've done good work using the functions from the stringr (I didn't know about the word function) and dplyr library.

PASS