Closed ChrisNygaard closed 9 years ago
Good job.
I think you might consider using dplyr
more systematically when doing data manipulation. It makes using sapply
and related functions less necessary.
For plotting, I think you should consider using lines instead of bars when representing time series data.
APPROVED
ASSIGNMENT 1
Packages & data ----------------------------------------------------------
First data is loaded and packages are opened
library("readr") library("lubridate") library("ggplot2") library("stringr") library("maps") library("countrycode")
df <- read_csv("https://raw.githubusercontent.com/MuseumofModernArt/collection/master/Artworks.csv")
After loading the data we make a quick summary statistic to investigate whether any
variables have missing values. This is done using the code below
sapply(df, function(x) sum(is.na(x)))
It shows that 4434 works on MOMA have no date of acquisition
We remove these observations and continue with the remaining 119.485 obs.
df <- df[complete.cases(df),] df <- na.omit(df)
Question 1&2 -------------------------------------------------------------
We find paintings in the data
then we create a variable with the value of the month (january is month 1, february month 2 etc.)
after this we create a variable with the name of the months and factor it with the values,
so it isn't sorted alphabetically df.paint <- subset(df, Classification == "Painting") df.paint$monthval <- format(df.paint$DateAcquired,format = "%m") df.paint$month <- format(df.paint$DateAcquired,format = "%B") df.paint$month <- factor(df.paint$month, levels = df.paint$month[order(df.paint$monthval)])
and then the plot, where we use the amount of paintings sold in each month
p = ggplot(data=df.paint, aes(x = month)) p + geom_bar(fill = "black") + labs(title = "Paintings per month", x = "Month", y = "Number of paintings")
Question 3 -------------------------------------------------------------
We again use ggplot, but the fill is weighted with the variable "CuratorApproved", to show how many paintings were approved
p = ggplot(data=df.paint, aes(x = month)) p + geom_bar(aes(fill = CuratorApproved)) + labs(title = "Paintings per month", x = "Month", y = "Number of paintings")
Question 4 -------------------------------------------------------------
We assume you don't mean "paintings" since all paintings are in the same department
hence, in this dataset, we use all works of art.
In this question the stock of art is grouped by department and month of aqusition
df.paint.dept = df df.paint.dept$monthval <- format(df.paint.dept$DateAcquired,format = "%m") df.paint.dept$month <- format(df.paint.dept$DateAcquired,format = "%B") df.paint.dept$month <- factor(df.paint.dept$month, levels = df.paint$month[order(df.paint$monthval)])
df.paint.dept <- df.paint.dept %>% group_by(Department, month)
Question 5 -------------------------------------------------------------
This time we make a plot of accumulated number of paintings by department grouped
by each month. As it is clear from the plot the biggist increase in paintings has
been in the 'paintings and sculpture' department
p=ggplot(data=df.paint.dept, aes(x=month)) p+geom_bar(aes(fill=Department), position=position_dodge()) + scale_fill_brewer(palette = "Blues")
Question 6 -------------------------------------------------------------
We create a dataframe where we count the number of paintings by each artist at MOMA
and then sort this list in order to show the 10 painters with the most paintings at MOMA
df.paint.artist = df.paint %>% group_by(Artist) %>% summarise( Paintings.by.artist = n()) %>% arrange(-Paintings.by.artist)
We then print the first 10 obs. hence the 10 painters with the most paintings
head(df.paint.artist, n=10)
Question 7 -------------------------------------------------------------
df.paint$nationality= str_extract(df.paint$ArtistBio, "[A-z]+")
Next we use a file created which match country and nationality
df.nation=read.table("C:/Users/Christoffer/Desktop/Uni/Social Data Science/R-filer/Country_Nationality.csv",sep=";", header=TRUE)
We change the name in the dataset from 'adjective' to 'nationality
names(df.nation)[names(df.nation)=="adjective"] <-"nationality" df.paint.country=left_join(df.paint, df.nation)
it is nice to investigate how many NA there are e.g. how many 'nationality'
variables has no corresponding country
df.NA=subset(df.paint.country, is.na(df.paint.country$country))
we then group by country
df.NA=df.NA %>% group_by(nationality) %>% summarise( countrycode=n())
We correct the mising obs. in the following way:
1) American are matched with USA
2) Argentine are matched with Argentina
3) Croatian are matched with Croatia
4) Israel is added
5) Korean are matched with South Korea
6) South are matched with South Africa
we accept no match for a nationality called 'born' and for two NA
We then use the countrycode package to add an iso3c variable to match the mapping data
df.paint.country$countrycode <- countrycode(df.paint.country$country, "country.name", "iso3c") world=map_data("world")
we add the iso3c again to the world data
world$countrycode <- countrycode(world$region, "country.name", "iso3c")
we are now ready to match the two datasets
df.paint.by.country <- df.paint.country %>% group_by(nationality) %>% summarise( Paintings.by.nationality = n()) df.question7=left_join(df.paint.country, df.paint.by.country) df.question7 <- left_join(world, df.question7)
We now remake a variable, to put the number of paintings in different categories, namely
1-9,10-19,20-29,30-39,40-49,50-99,100-199,200-499 & >500
df.question7$paintbynatinterval <- cut(df.question7$Paintings.by.nationality, breaks = c(-Inf,10,20,30,40,50,100,200,500,Inf), labels=c("1-9", "10-19", "20-29", "30-39", "40-49", "50-99", "100-199","200-499",">500"), right=FALSE)
p=ggplot(df.question7, aes(x=long, y=lat, group=group)) p + geom_polygon(aes(fill=paintbynatinterval), colour = "black") + scale_fill_brewer(palette = "Blues")
Question 8 -------------------------------------------------------------
df.dimensions <- df
First we use a grouping, where we extract the dimensions in cm
df.dimensions$DimensionsExtract <- sub(".?(([\d]+.?[\d])\s[xX\xD7]\s([\d]+.?[\d]).", "\1 \2", df.dimensions$Dimensions, perl = TRUE)
We remove problematic observations and then split observations in height and with.
df.dimensions <- filter(df.dimensions, grepl("[\d]+.?[\d]\s[\d]+.?[\d]$", df.dimensions$DimensionsExtract, perl = TRUE)) df.dimensions <- filter(df.dimensions, grepl("^[\d.]*\s[\d.]+$", df.dimensions$DimensionsExtract, perl = TRUE)) df.dimensions$Height <- as.numeric(str_split_fixed(df.dimensions$DimensionsExtract, " ", 2)[,1]) df.dimensions$Width <- as.numeric(str_split_fixed(df.dimensions$DimensionsExtract, " ", 2)[,2])
Then we create the area variable
df.dimensions$area <- df.dimensions$Height * df.dimensions$Width
We do as previously, only grouping by area instead.
Then we arrange descending then showing head (top) and tail (bottom)
df.dimensions.size <- df.dimensions %>% group_by(area) %>% arrange(-area) head(df.dimensions.size, n=5) tail(df.dimensions.size, n=5)