sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Group 7 assignment #29

Closed neiljg closed 8 years ago

neiljg commented 8 years ago

library("readr") library("knitr") library("devtools") library("plyr") library("dplyr") library("ggplot2") library("lubridate") library("countrycode") library("mapdata") library("ggmap") library("maps") library("stringr")

User should change the path to fit location of file nationality.csv

mypath="C:/Users/Neil/Documents/Polit studiet/Kandidat/3. semester/Social Data Science/Assignment 1/nationality.csv"

df.all <- read_csv("https://raw.githubusercontent.com/MuseumofModernArt/collection/master/Artworks.csv")

Question 1

Cleaning data - removing observations without DateAcquired, and

restricting to only include paintings

df = df.all %>% filter(!is.na(DateAcquired)) df = df %>% filter(Classification=="Painting")

Change DateAcquired to date format that includes only month and year

df$shortdate <- strftime(df$DateAcquired,"%Y-%m")

Sorting data by date to get the cumulative stock

df <- df[order(as.Date(paste(df$shortdate,"-01",sep=""), format="%Y-%m-%d")),]

Creating a new column with the cumulative stock of works

df<- data.frame(df[1:15],1) colnames(df)[16] <- "ones" df <- data.frame(df[1:16],cumsum(df$ones)) colnames(df)[17] <- "Stock"

Question 2

Creating figure

Re-configure date variable to include a nominal day value

df$newdate <- as.Date(paste(df$shortdate,"-01",sep=""))

p = ggplot(df, aes(x = as.Date(newdate), y =cumsum(ones)/1000)) + labs(x = "Time", y = "Cumulative stock, 1,000", title = "Stock of Paintings in MoMA") p + geom_line(color="red")

Question 3

curator = df %>% group_by(newdate, CuratorApproved) %>% summarise(Stock =n())

curator2= curator %>% group_by(CuratorApproved) %>% mutate(Stock1 = cumsum(Stock))

p = ggplot(curator2, aes(x = as.Date(newdate), y = Stock1)) + labs(x = "Time", y = "Cumulative stock", title = "Stock of Paintings in MoMA") p + geom_line(aes(group=CuratorApproved, colour = CuratorApproved))

Question 4

Having conditioned the data to include only paintings, only four departments remain

table(df$Department)

Question 5

p = ggplot(df,aes(x=Department)) + geom_histogram() plot(p)

We can see that one department clearly dominates

Question 6

artists <- as.data.frame(table(df$Artist)) artists10 <- head(artists[rev(order(artists$Freq)),],10)

artists10

Here, we find the 10 artists who have contributed most paintings, and the number of paintings.

Question 7

The first piece of code pulls out the first character string in the ArtistBio column,

unless the column contains the word "born", in which case it pulls out the character

string following "born".

df$birthplace <- apply(df, 1, function(x) ifelse(length(grep("born",x[3])), gsub(pattern = "(., born)(.)(. .)", replacement = "\2",x[3]), substring(gsub(",.$", "", x[3]),2)))

However, in some cases born is not followed by a character string, just numbers, i.e.

a year of birth, in which case we still want the first character string in the ArtistBio

column. Other exceptions are not captured, and will thus not be matched to countries.

df$birthplace <- apply(df, 1, function(x) ifelse(length(grep("born",x[19])), substring(gsub(",.*$", "", x[19]),2), x[19]))

Many of the birthplace variables are nationalities, e.g. "French", instead of "France".

We import a data set to translate nationalities to country names, collected manually

from the web. The import uses the mypath variable from the start of this script.

nat <- read_csv(mypath) nat$birthplace <- nat$V2 nat <- nat[,c("V1","birthplace")] nat$birthplace <- substr(nat$birthplace,1,nchar(nat$birthplace)-1)

Here, we merge country names by matching nationalities.

df2 <- left_join(df,nat) df2$country <- df2$V1

If no match has been made, we simply prescribe the birthplace column again, as this

column also includes actual country names, not nationalities, in some cases.

df2$country[is.na(df2$V1)] <- df2$birthplace[is.na(df2$V1)]

We now create a new variable with UN country codes, based on the country names.

df2$code <- countrycode(df2$country,"country.name","un")

sum(is.na(df2$code))

115 of the paintings have not been assigned a country code, either because the

the string analysis was unsuccesful, or the country does not match countrycodes

database, e.g. "Russia (now Latvia)".

Now, we import a world map.

world <- map_data("world")

And create a UN country code variable.

world$code <- countrycode(world$region,"country.name","un")

We count the number of paintings by country

df3 <- count(df2,code)

And merge this data to the map data set.

world_data <- right_join(df3, world)

If no data has been merged, then the country has contributed 0 paintings.

world_data$n[is.na(world_data$n)] <- 0

p = ggplot(world_data, aes(x = long, y = lat, group = group)) + geom_polygon(aes(fill = n)) + expand_limits() + theme_minimal() p

We can see that artists born in the USA contribute by far the most paintings.

Question 8

We find the metric dimensions by extracting the string from within the parenthesis

of the dimensions column.

df$size=str_extract(df$Dimensions, "([0.0-9.9]+ x [0.0-9.9]+ cm)")

The length is then the first part of this string, the width is the second.

df$sizeL=gsub("x [0.0-9.9]+ cm", "", df$size) df$sizeB=gsub("[0.0-9.9]+ x", "", df$size)

df$sizeB=gsub("cm", "", df$sizeB)

The area is the product of these two numbers in cm squared.

df$areal = as.numeric(df$sizeB)*as.numeric(df$sizeL)

Rangorden

df4 <- df[!is.na(df$areal),] df4 <- df4[order(df4$areal, decreasing = FALSE),]

The 5 smallest paintings and the artist.

head(df4[,c(2,23)],5)

The 5 largest paintings and the artist.

tail(df4[,c(2,23)],5)

neiljg commented 8 years ago

"","V1","V2","V3" "1","Afghanistan ","Afghan ","an Afghan" "2","Algeria ","Algerian ","an Algerian" "3","Angola ","Angolan ","an Angolan" "4","Argentina ","Argentine ","an Argentine" "5","Austria ","Austrian ","an Austrian" "6","Australia ","Australian ","an Australian" "7","Bangladesh ","Bangladeshi ","a Bangladeshi" "8","Belarus ","Belarusian ","a Belarusian" "9","Belgium ","Belgian ","a Belgian" "10","Bolivia ","Bolivian ","a Bolivian" "11","Bosnia and Herzegovina ","Bosnian/Herzegovinian ","a Bosnian/a Herzegovinian" "12","Brazil ","Brazilian ","a Brazilian" "13","Britain ","British ","a Briton (informally: a Brit)" "14","Bulgaria ","Bulgarian ","a Bulgarian" "15","Cambodia ","Cambodian ","a Cambodian" "16","Cameroon ","Cameroonian ","a Cameroonian" "17","Canada ","Canadian ","a Canadian" "18","Central African Republic ","Central African ","a Central African" "19","Chad ","Chadian ","a Chadian" "20","China ","Chinese ","a Chinese person" "21","Colombia ","Colombian ","a Colombian" "22","Costa Rica ","Costa Rican ","a Costa Rican" "23","Croatia ","Croatian ","a Croat" "24","the Czech Republic ","Czech ","a Czech person" "25","Democratic Republic of the Congo ","Congolese ","a Congolese person (note: this refers to people from the Republic of the Congo as well)" "26","Denmark ","Danish ","a Dane" "27","Ecuador ","Ecuadorian ","an Ecuadorian" "28","Egypt ","Egyptian ","an Egyptian" "29","El Salvador ","Salvadoran ","a Salvadoran (also accepted are Salvadorian & Salvadorean)" "30","England ","English ","an Englishman/Englishwoman" "31","Estonia ","Estonian ","an Estonian" "32","Ethiopia ","Ethiopian ","an Ethiopian" "33","Finland ","Finnish ","a Finn" "34","France ","French ","a Frenchman/Frenchwoman" "35","Germany ","German ","a German" "36","Ghana ","Ghanaian ","a Ghanaian" "37","Greece ","Greek ","a Greek" "38","Guatemala ","Guatemalan ","a Guatemalan" "39","Holland ","Dutch ","a Dutchman/Dutchwoman" "40","Honduras ","Honduran ","a Honduran" "41","Hungary ","Hungarian ","a Hungarian" "42","Iceland ","Icelandic ","an Icelander" "43","India ","Indian ","an Indian" "44","Indonesia ","Indonesian ","an Indonesian" "45","Iran ","Iranian ","an Iranian" "46","Iraq ","Iraqi ","an Iraqi" "47","Ireland ","Irish ","an Irishman/Irishwoman" "48","Israel ","Israeli ","an Israeli" "49","Italy ","Italian ","an Italian" "50","Ivory Coast ","Ivorian ","an Ivorian" "51","Jamaica ","Jamaican ","a Jamaican" "52","Japan ","Japanese ","a Japanese person" "53","Jordan ","Jordanian ","a Jordanian" "54","Kazakhstan ","Kazakh ","a Kazakhstani (used as a noun, a Kazakh refers to an ethnic group, not a nationality)" "55","Kenya ","Kenyan ","a Kenyan" "56","Laos ","Lao ","a Laotian (used as a noun, a Lao refers to an ethnic group, not a nationality)" "57","Latvia ","Latvian ","a Latvian" "58","Libya ","Libyan ","a Libyan" "59","Lithuania ","Lithuanian ","a Lithuanian" "60","Madagascar ","Malagasy ","a Malagasy" "61","Malaysia ","Malaysian ","a Malaysian" "62","Mali ","Malian ","a Malian" "63","Mauritania ","Mauritanian ","a Mauritanian" "64","Mexico ","Mexican ","a Mexican* (may be offensive in the USA. Use someone from Mexico instead.)" "65","Morocco ","Moroccan ","a Moroccan" "66","Namibia ","Namibian ","a Namibian" "67","New Zealand ","New Zealand ","a New Zealander" "68","Nicaragua ","Nicaraguan ","a Nicaraguan" "69","Niger ","Nigerien ","a Nigerien" "70","Nigeria ","Nigerian ","a Nigerian" "71","Norway ","Norwegian ","a Norwegian" "72","Oman ","Omani ","an Omani" "73","Pakistan ","Pakistani ","a Pakistani* (may be offensive in the UK. Use someone from Pakistan instead.)" "74","Panama ","Panamanian ","a Panamanian" "75","Paraguay ","Paraguayan ","a Paraguayan" "76","Peru ","Peruvian ","a Peruvian" "77","The Philippines ","Philippine ","a Filipino* (someone from the Philippines)" "78","Poland ","Polish ","a Pole* (someone from Poland, a Polish person)" "79","Portugal ","Portuguese ","a Portuguese person" "80","Republic of the Congo ","Congolese ","a Congolese person (note: this refers to people from the Democratic Republic of the Congo as well)" "81","Romania ","Romanian ","a Romanian" "82","Russia ","Russian ","a Russian" "83","Saudi Arabia ","Saudi, Saudi Arabian ","a Saudi, a Saudi Arabian" "84","Scotland ","Scottish ","a Scot" "85","Senegal ","Senegalese ","a Senegalese person" "86","Serbia ","Serbian ","a Serbian (used as a noun, a Serb refers to an ethnic group, not a nationality" "87","Singapore ","Singaporean ","a Singaporean" "88","Slovakia ","Slovak ","a Slovak" "89","Somalia ","Somalian ","a Somalian" "90","South Africa ","South African ","a South African" "91","Spain ","Spanish ","a Spaniard* (a Spanish person, someone from Spain)" "92","Sudan ","Sudanese ","a Sudanese person" "93","Sweden ","Swedish ","a Swede" "94","Switzerland ","Swiss ","a Swiss person" "95","Syria ","Syrian ","a Syrian" "96","Thailand ","Thai ","a Thai person" "97","Tunisia ","Tunisian ","a Tunisian" "98","Turkey ","Turkish ","a Turk" "99","Turkmenistan ","Turkmen ","a Turkmen / the Turkmens" "100","Ukraine ","Ukranian ","a Ukranian" "101","The United Arab Emirates ","Emirati ","an Emirati" "102","The United States ","American ","an American" "103","Uruguay ","Uruguayan ","a Uruguayan" "104","Vietnam ","Vietnamese ","a Vietnamese person" "105","Wales ","Welsh ","a Welshman/Welshwoman" "106","Zambia ","Zambian ","a Zambian" "107","Zimbabwe ","Zimbabwean ","a Zimbabwean"

sebastianbarfort commented 8 years ago

Very good assignment.

You're using apply functions which is very nice.

Nice use of the piping operator.

Keep up the good work!

APPROVED