sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Group 19: Assignment 1 #19

Closed ChrisNygaard closed 9 years ago

ChrisNygaard commented 9 years ago

ASSIGNMENT 1

Packages & data ----------------------------------------------------------

First data is loaded and packages are opened

library("readr") library("lubridate") library("ggplot2") library("stringr") library("maps") library("countrycode")

df <- read_csv("https://raw.githubusercontent.com/MuseumofModernArt/collection/master/Artworks.csv")

After loading the data we make a quick summary statistic to investigate whether any

variables have missing values. This is done using the code below

sapply(df, function(x) sum(is.na(x)))

It shows that 4434 works on MOMA have no date of acquisition

We remove these observations and continue with the remaining 119.485 obs.

df <- df[complete.cases(df),] df <- na.omit(df)

Question 1&2 -------------------------------------------------------------

We find paintings in the data

then we create a variable with the value of the month (january is month 1, february month 2 etc.)

after this we create a variable with the name of the months and factor it with the values,

so it isn't sorted alphabetically df.paint <- subset(df, Classification == "Painting") df.paint$monthval <- format(df.paint$DateAcquired,format = "%m") df.paint$month <- format(df.paint$DateAcquired,format = "%B") df.paint$month <- factor(df.paint$month, levels = df.paint$month[order(df.paint$monthval)])

and then the plot, where we use the amount of paintings sold in each month

p = ggplot(data=df.paint, aes(x = month)) p + geom_bar(fill = "black") + labs(title = "Paintings per month", x = "Month", y = "Number of paintings")

Question 3 -------------------------------------------------------------

We again use ggplot, but the fill is weighted with the variable "CuratorApproved", to show how many paintings were approved

p = ggplot(data=df.paint, aes(x = month)) p + geom_bar(aes(fill = CuratorApproved)) + labs(title = "Paintings per month", x = "Month", y = "Number of paintings")

Question 4 -------------------------------------------------------------

We assume you don't mean "paintings" since all paintings are in the same department

hence, in this dataset, we use all works of art.

In this question the stock of art is grouped by department and month of aqusition

df.paint.dept = df df.paint.dept$monthval <- format(df.paint.dept$DateAcquired,format = "%m") df.paint.dept$month <- format(df.paint.dept$DateAcquired,format = "%B") df.paint.dept$month <- factor(df.paint.dept$month, levels = df.paint$month[order(df.paint$monthval)])

df.paint.dept <- df.paint.dept %>% group_by(Department, month)

Question 5 -------------------------------------------------------------

This time we make a plot of accumulated number of paintings by department grouped

by each month. As it is clear from the plot the biggist increase in paintings has

been in the 'paintings and sculpture' department

p=ggplot(data=df.paint.dept, aes(x=month)) p+geom_bar(aes(fill=Department), position=position_dodge()) + scale_fill_brewer(palette = "Blues")

Question 6 -------------------------------------------------------------

We create a dataframe where we count the number of paintings by each artist at MOMA

and then sort this list in order to show the 10 painters with the most paintings at MOMA

df.paint.artist = df.paint %>% group_by(Artist) %>% summarise( Paintings.by.artist = n()) %>% arrange(-Paintings.by.artist)

We then print the first 10 obs. hence the 10 painters with the most paintings

head(df.paint.artist, n=10)

Question 7 -------------------------------------------------------------

df.paint$nationality= str_extract(df.paint$ArtistBio, "[A-z]+")

Next we use a file created which match country and nationality

df.nation=read.table("C:/Users/Christoffer/Desktop/Uni/Social Data Science/R-filer/Country_Nationality.csv",sep=";", header=TRUE)

We change the name in the dataset from 'adjective' to 'nationality

names(df.nation)[names(df.nation)=="adjective"] <-"nationality" df.paint.country=left_join(df.paint, df.nation)

it is nice to investigate how many NA there are e.g. how many 'nationality'

variables has no corresponding country

df.NA=subset(df.paint.country, is.na(df.paint.country$country))

we then group by country

df.NA=df.NA %>% group_by(nationality) %>% summarise( countrycode=n())

We correct the mising obs. in the following way:

1) American are matched with USA

2) Argentine are matched with Argentina

3) Croatian are matched with Croatia

4) Israel is added

5) Korean are matched with South Korea

6) South are matched with South Africa

we accept no match for a nationality called 'born' and for two NA

We then use the countrycode package to add an iso3c variable to match the mapping data

df.paint.country$countrycode <- countrycode(df.paint.country$country, "country.name", "iso3c") world=map_data("world")

we add the iso3c again to the world data

world$countrycode <- countrycode(world$region, "country.name", "iso3c")

we are now ready to match the two datasets

df.paint.by.country <- df.paint.country %>% group_by(nationality) %>% summarise( Paintings.by.nationality = n()) df.question7=left_join(df.paint.country, df.paint.by.country) df.question7 <- left_join(world, df.question7)

We now remake a variable, to put the number of paintings in different categories, namely

1-9,10-19,20-29,30-39,40-49,50-99,100-199,200-499 & >500

df.question7$paintbynatinterval <- cut(df.question7$Paintings.by.nationality, breaks = c(-Inf,10,20,30,40,50,100,200,500,Inf), labels=c("1-9", "10-19", "20-29", "30-39", "40-49", "50-99", "100-199","200-499",">500"), right=FALSE)

p=ggplot(df.question7, aes(x=long, y=lat, group=group)) p + geom_polygon(aes(fill=paintbynatinterval), colour = "black") + scale_fill_brewer(palette = "Blues")

Question 8 -------------------------------------------------------------

df.dimensions <- df

First we use a grouping, where we extract the dimensions in cm

df.dimensions$DimensionsExtract <- sub(".?(([\d]+.?[\d])\s[xX\xD7]\s([\d]+.?[\d]).", "\1 \2", df.dimensions$Dimensions, perl = TRUE)

We remove problematic observations and then split observations in height and with.

df.dimensions <- filter(df.dimensions, grepl("[\d]+.?[\d]\s[\d]+.?[\d]$", df.dimensions$DimensionsExtract, perl = TRUE)) df.dimensions <- filter(df.dimensions, grepl("^[\d.]*\s[\d.]+$", df.dimensions$DimensionsExtract, perl = TRUE)) df.dimensions$Height <- as.numeric(str_split_fixed(df.dimensions$DimensionsExtract, " ", 2)[,1]) df.dimensions$Width <- as.numeric(str_split_fixed(df.dimensions$DimensionsExtract, " ", 2)[,2])

Then we create the area variable

df.dimensions$area <- df.dimensions$Height * df.dimensions$Width

We do as previously, only grouping by area instead.

Then we arrange descending then showing head (top) and tail (bottom)

df.dimensions.size <- df.dimensions %>% group_by(area) %>% arrange(-area) head(df.dimensions.size, n=5) tail(df.dimensions.size, n=5)

sebastianbarfort commented 9 years ago

Good job.

I think you might consider using dplyr more systematically when doing data manipulation. It makes using sapply and related functions less necessary.

For plotting, I think you should consider using lines instead of bars when representing time series data.

APPROVED