Closed bjarkedahl closed 9 years ago
Very well done!
I really like your solution to question 7. You're going about it slightly inefficient, but the idea is very nice.
You've done good work using the functions from the stringr
(I didn't know about the word
function) and dplyr
library.
PASS
Assignment 1 - Group 16
Read in the datafile
library("readr") df = read_csv("https://raw.githubusercontent.com/MuseumofModernArt/collection/master/Artworks.csv")
---------------------------------------------------------_
Question 1
---------------------------------------------------------_
We check how the first observations for DateAcquired are listed
head(df$DateAcquired) library("lubridate") library("dplyr")
Then we select only the observations classed as Paintings and put them in the dataframe df_stock
df_stock = df %>% filter(Classification == "Painting")
Now we extract the month of DateAcquired into a seperate column so we can group the stock for each month of the year
df_stock = df_stock[!is.na(df_stock$DateAcquired),] # Dataframe only with nonmissing dates of acquired df_stock$ymd = ymd(df_stock$DateAcquired) df_stock$month = month(df_stock$ymd, label = TRUE, abbr = TRUE) #Generating a column - month containing month abbreviations for the figure
---------------------------------------------------------_
Question 2
---------------------------------------------------------_
We now create a histogram for the stock of paintings which counts how many paintings have been acquired
in a given month during the year. We choose a histogram since it nicely depicts that paintings are mostly
acquired in given months. The histogram is colored red and is given a title and labels
library("ggplot2") p = ggplot(data = df_stock, aes(x = month)) p = p + geom_histogram(fill="red") + labs(title = "Stock of paintings acquired in each month", x = "Month", y = "Number of paintings acquired") p = p + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), axis.line = element_line(colour = "black")) plot(p)
---------------------------------------------------------_
Question 3
---------------------------------------------------------_
The same plot is now colored to show if the painting is curator approved or not
this is done by implementing the fill = CuratorApproved statement
p = ggplot(data = df_stock, aes(x = month, fill = CuratorApproved)) p = p + geom_histogram() + labs(title = "Stock of paintings acquired in each month", x = "Month", y = "Number of paintings acquired") p = p + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), axis.line = element_line(colour = "black")) plot(p)
---------------------------------------------------------_
Question 4
---------------------------------------------------------_
A new dataframe containing the stock of paintings grouped by what department they belong to is created and named df_dep
df_dep = df_stock %>% filter(!is.na(Department)) %>% #Removing paintings with missing department group_by(Department)
---------------------------------------------------------_
Question 5
---------------------------------------------------------_
A histogram showing the stock of paintings in each department is created.
To make it easier to read the names of the department, the histograms axes are switched.
The department "Painting and sculpture" had the highest increase in their stock of paintings
This makes perfectly sense, since we are only considering paintings
p = ggplot(data = df_dep, aes(x = Department)) p = p + geom_histogram() + coord_flip() + labs(title = "Stock of paintings in each department", x = "Department", y = "Number of paintings") p = p + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), axis.line = element_line(colour = "black")) plot(p)
---------------------------------------------------------_
Question 6
---------------------------------------------------------_
A tablecontaining the top 10 artist are displaced
sort(table(df_stock$Artist),decreasing=TRUE)[1:10]
---------------------------------------------------------_
Question 7
---------------------------------------------------------_
We start by extracting the nationalities from the Artistbio column
library("stringr") df_stock$Nationality = str_extract(df_stock$ArtistBio, "[A-Z][a-z]+") df_stock = df_stock[!is.na(df_stock$Nationality),] #removing observations with missing nationality
Making sure, that we actually get the artist birth-country by creating a born-variable
df_stock$natio= gsub(",", "", df_stock$ArtistBio) df_stock$natio= gsub("(", "", df_stock$natio) df_stock$natio= gsub(")", "", df_stock$natio) df_stock$born = str_extract(df_stock$natio, "born .* ") df_stock$born2 = gsub("born", "", df_stock$born) df_stock$born = gsub("[0-9]", "", df_stock$born2) df_stock$born = gsub(" ", "", df_stock$born) df_stock$born3 = str_extract(df_stock$born, "[A-Z][a-z]*")
Using ifelse-command to isolate the correct birth-nationality
df_stock$nationality = ifelse(is.na(df_stock$born3), df_stock$Nationality, df_stock$born3) sort(unique(df_stock$nationality)) #we check if some of the nationalities are similarly and should have been the same
We had problems getting the nationalities to country names, so they would be capable of a merge with the maps package
We looked for help at a fellow student (Adam) who allowed us to use his external files from Adam's GitHub repository
Data on Countries and Nationalities, tab-separated from (https://www.englishclub.com/vocabulary/world-countries-nationality.htm) and more.
Nat1 <- read.csv("https://raw.githubusercontent.com/adamingwersen/Data_for_assignment1_SDS/master/Nationalities2.txt", sep="\t", header = FALSE)
Next, we clean up the Nat1 Dataframe by renaming variables, removing the empty column and removing duplicates in Nationality
names(Nat1) = c("Country", "nationality", "") #renaming the first column country, the second column Nationality and the third column blank Nat1 = Nat1[, c("Country", "nationality")] #removing the empty column Nat1[duplicated(Nat1$nationality),] Nat1 = Nat1[!duplicated(Nat1$nationality),] #removing one of the duplicates in Nationality by refering to the country name
Creating a dataframe - df_stock_nat - containing nationality and country of birth of the Artist. This is done by "merging" on nationality
library("dplyr") df_stock_nat = left_join(df_stock, Nat1, by = "nationality") #we do not have country names for all nationalities and there some observations are missing in df_stock_nat compared to df_stock
Since some of our values in the df_stock dataframe had the form of nationality we use this code to keep them in the new dataframe
df_stock_nat$Country2 = ifelse(is.na(df_stock_nat$Country), df_stock_nat$nationality, as.character(df_stock_nat$Country)) as.character(df_stock_nat$Country2)
Creating a new dataframe - map.df - using ggmap's world map
library("ggmap") map.df = map_data("world")
Adding countrycode (iso2c) to the dataframes: map.df and df_stock_nat.df
library("countrycode") map.df$iso2c = countrycode(map.df$region, origin = "country.name", destination = "iso2c") df_stock_nat$iso2c = countrycode(df_stock_nat$Country2, origin = "country.name", destination = "iso2c")
Finding the total number of paintings from each country
library("dplyr") df_stock_map = df_stock_nat %>% select(iso2c) %>% group_by(iso2c) %>% summarise(number = n())
Joining dataframes: df_stock_map and map.df so i can make the map
library("dplyr") df_paintings = inner_join(map.df, df_stock_map, by = "iso2c")
Due to the fact that American artists can be attributed the majority of paintings held in the MoMA, we can use log - however the meaning of this plot is somewhat nonexistent.
library("ggplot2") p = ggplot(df_paintings, aes(x = long, y = lat, group = group, fill = number)) p = p + geom_polygon() p = p + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), axis.line = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank()) p = p + labs(title = "Number of painters from each country") p = p + scale_fill_gradient(low = "#00b3b3", high = "#cccc00", guide = "colorbar") plot(p)
---------------------------------------------------------_
Question 8
---------------------------------------------------------_
library("stringr") library("dplyr")
We start by destringing the dimensions variable so we get a variable ingluding height and one including length
df_stock$dim1 = str_extract_all(df_stock$Dimensions, "([^()]+)"[[1]]) df_stock$dimension_cm = gsub("(", "", df_stock$dim1) df_stock$dimension_cm = gsub(")", "", df_stock$dimension_cm) df_stock$height = word(df_stock$dimension_cm, +1) df_stock$length = word(df_stock$dimension_cm, -2) df_stock$height = as.numeric(df_stock$h) df_stock$length = as.numeric(df_stock$l) df_stock$area = df_stock$length * df_stock$height #calculating the area
We do not get the correct measures but here's how we would display the 5 largest and smallest paintings:
Creating a dataframe including the 5 biggest paintings
df.area1 = dfstock %>% arrange(~ desc(area)) %>% slice(1:5)
Creating a dataframe including the 5 smallest paintings
df.area2 = dfstock %>% arrange(~ (area)) %>% slice(1:5)
Connencting the 2 dataframes into one dataframe including the 5 biggest and 5 smallest paintings
df.area = rbind(df.area1, df.area2)