jeff1evesque / ist-687

Syracuse IST687 final project with Jesse Warren (team member)
2 stars 0 forks source link

Create logic to combine month columns #23

Closed jeff1evesque closed 6 years ago

jeff1evesque commented 6 years ago

After the first column has been exploded into four columns (i.e. Access, Agent, Article, Language), from basic.R, we need to aggregate the successive columns based on the MYYYY.mm pattern. This means the date columns, will need to be summed with one another based on the aggregation.

Note: we may later choose to move this logic, to a dedicated custom R package.

jeff1evesque commented 6 years ago

109119b: like our dataset/ directory, we do not need to version control anything within the visualization/ directory, since they are calculated at runtime, and could vary between one execution and the next.

jeff1evesque commented 6 years ago

We've manually tested the following:

df1temp <- df1[1,1:7]
df2temp <- df2[1,1:7]

## year range
df1temp_start_date <- as.Date(colnames(df1temp)[5], format='X%Y.%m.%d')
df1temp_end_date <- as.Date(colnames(df1temp)[length(colnames(df1temp))], format='X%Y.%m.%d')
df2temp_start_date <- as.Date(colnames(df2temp)[5], format='X%Y.%m.%d')
df2temp_end_date <- as.Date(colnames(df2temp)[length(colnames(df2temp))], format='X%Y.%m.%d')

## combine columns
while (df1temp_start_date <= df1temp_end_date) {
  Reduce(
    '+',
    df1temp[,grep(paste0('X',format(df1temp_start_date,"%Y.%m")),names(df1temp))]
  )
}

while (df2temp_start_date <= df2temp_end_date) {
  Reduce(
    '+',
    df2temp[,grep(paste0('X',format(df2temp_start_date,"%Y.%m")),names(df2temp))]
  )
}

But, the r console seems stuck for the last 5 minutes. The following are the df[1|2]temp values:

> df1temp
      Access  Agent Article Language X2015.07.01 X2015.07.02 X2015.07.03
1 all-access spider    2NE1       zh          18          11           5
> df2temp
      Access  Agent Article Language X2015.07.01 X2015.07.02 X2015.07.03
1 all-access spider    2NE1       zh          18          11           5
jeff1evesque commented 6 years ago

We tried to simplify our while loop to the following:

df1temp <- df1[1,1:7]
df2temp <- df2[1,1:7]

## year range
df1temp_start_date <- as.Date(colnames(df1temp)[5], format='X%Y.%m.%d')
df1temp_end_date <- as.Date(colnames(df1temp)[length(colnames(df1temp))], format='X%Y.%m.%d')
df2temp_start_date <- as.Date(colnames(df2temp)[5], format='X%Y.%m.%d')
df2temp_end_date <- as.Date(colnames(df2temp)[length(colnames(df2temp))], format='X%Y.%m.%d')

## local variables
start_date1 <- df1temp_start_date
start_date2 <- df2temp_start_date

## combine columns
while (start_date1 <= df1temp_end_date) {
  paste('yes')
}

while (start_date2 <= df2temp_end_date) {
  paste('yes')
}

However, after 10+ minutes, it seems our logic is still running. This likely means, we may need to adjust our loop structure, or find a different implementation.

jeff1evesque commented 6 years ago

Our committed changes produces a df_aggregate2 dataframe:

dataframe

Note: we manually verified that the df1_aggregate is very similar in structure