Enhance column dataset splitting

jeff1evesque / ist-687

Syracuse IST687 final project with Jesse Warren (team member)

2 stars 0 forks source link

Enhance column dataset splitting #14

Closed jeff1evesque closed 6 years ago

jeff1evesque commented 6 years ago

It seems many articles have underscores in it's name. For example, are_belong_to_us is part of the article name, for row 13 (similar case is present on row 12):

dataframe

This is likely a byproduct of the space character being converted to an underscore, when article names are converted, prior to the creation of the dataset. So, we'll need to be slightly more clever, when exploding the original first column into multiple columns. Additionally, we should also attempt to create a dedicated column for the language type.

jeff1evesque commented 6 years ago

We can split on the following:

first period instance
- split on last instance of underscore, to obtain a language column
any successive underscore instances

The following will better visualize the process:

excel snippet

jeff1evesque commented 6 years ago

51022de: doesn't seem like the most efficient concept, since the runtime seems costly. However, given that we need to match multiple different patterns, where some could exist multiple times, across the dataframe column, this solution will at least be a temporary solution.

jeff1evesque commented 6 years ago

Unfortunately, it seems that our first column is not being split

dataframe

This is most likely because the following logic from basic.R is not executing:

## explode column: second column into Article, and Language columns
df1 <- cbind(
  colsplit(df1$first, pattern=perl('_(?=[^_]+$)'), c('Article', 'Language')),
  df1[,-which(names(df1) == 'first')]
)
df2 <- cbind(
  colsplit(df2$first, pattern=perl('_(?=[^_]+$)'), c('Article', 'Language')),
  df2[,-which(names(df2) == 'first')]
)

jeff1evesque commented 6 years ago

The corresponding traceback confirms our guess:

> ## explode column: second column into Article, and Language columns
> df1 <- cbind(
+   colsplit(df1$first, pattern=perl('_(?=[^_]+$)'), c('Article', 'Language')),
+   df1[,-which(names(df1) == 'first')]
+ )
Error in perl("_(?=[^_]+$)") : could not find function "perl"
> df2 <- cbind(
+   colsplit(df2$first, pattern=perl('_(?=[^_]+$)'), c('Article', 'Language')),
+   df2[,-which(names(df2) == 'first')]
+ )
Error in perl("_(?=[^_]+$)") : could not find function "perl"

jeff1evesque commented 6 years ago

4ae1f6a: from a quick pass, it seems our implementation with regex is sufficient:

dataframe

However, very few cases doesn't seem to work out:

dataframe