Closed jeff1evesque closed 6 years ago
We can split on the following:
The following will better visualize the process:
51022de: doesn't seem like the most efficient concept, since the runtime seems costly. However, given that we need to match multiple different patterns, where some could exist multiple times, across the dataframe column, this solution will at least be a temporary solution.
Unfortunately, it seems that our first
column is not being split
This is most likely because the following logic from basic.R
is not executing:
## explode column: second column into Article, and Language columns
df1 <- cbind(
colsplit(df1$first, pattern=perl('_(?=[^_]+$)'), c('Article', 'Language')),
df1[,-which(names(df1) == 'first')]
)
df2 <- cbind(
colsplit(df2$first, pattern=perl('_(?=[^_]+$)'), c('Article', 'Language')),
df2[,-which(names(df2) == 'first')]
)
The corresponding traceback confirms our guess:
> ## explode column: second column into Article, and Language columns
> df1 <- cbind(
+ colsplit(df1$first, pattern=perl('_(?=[^_]+$)'), c('Article', 'Language')),
+ df1[,-which(names(df1) == 'first')]
+ )
Error in perl("_(?=[^_]+$)") : could not find function "perl"
> df2 <- cbind(
+ colsplit(df2$first, pattern=perl('_(?=[^_]+$)'), c('Article', 'Language')),
+ df2[,-which(names(df2) == 'first')]
+ )
Error in perl("_(?=[^_]+$)") : could not find function "perl"
4ae1f6a: from a quick pass, it seems our implementation with regex
is sufficient:
However, very few cases doesn't seem to work out:
It seems many articles have underscores in it's name. For example,
are_belong_to_us
is part of the article name, for row 13 (similar case is present on row 12):This is likely a byproduct of the space character being converted to an underscore, when article names are converted, prior to the creation of the dataset. So, we'll need to be slightly more clever, when exploding the original first column into multiple columns. Additionally, we should also attempt to create a dedicated column for the language type.