Closed trinker closed 6 years ago
For now this could be stored with termco as a metatag
attribute...essentially as a hash table of sub tags and parent tags (2 columns with the meta
and sub
. This can be added in one of two ways...via the __
or what ever separator specified OR the user can pass it in. This would have a add_metatags(termco_obj, tags)
function [returns termco] and for manually adding the tags afterward. The former would look for the separator on every termin add_metatags<-
[act directly on the termco obj in place]term_count
and if found this is extracted and the metatags hash is created.
It also makes sense to have a tidy_term_count
object for turning the object into long format...the tidy would look for the metatags attribute and if found put that on the output...then plot methods like Note: this part didn't make sense after actually hooking up the plumbing since the original rows of those without a tag are droppeddiscrimination
and distribution
and co_occurence
would be available by default. It'd be nice if the ggplot code that was used to make these sorts of plots was returned as well for easy modification.
metatags
could take multi column frame that would be multiple parents including nested...just so long as every term in term_count
is matched (otherwise warning thrown and thrown quick before term_count
runs). This could be accomplished by using 2+ separators (I could detect this but that's a lot of extra work and problem prone) or by passing the multiple columns to add_metatags
directly.
collapse_tags
& update_names
rename_tags
would need to remake the metatags as would drop_tags
if this ever becomes a function. [went with select_tags
select_counts
as grouping.vars could be selected as well; used counts b/c this is a counts table.
Need to happen in term_count
and token_count
For the most part is is dangerous to alter the metatags after altering column names and they are dropped instead with a warning
MUST have a column named tag
in metatags
attribute
validate_term_count <- termco:::validate_term_count
## test for term and token counts
tidy_counts <- function(x, n = Inf, ...){
validate_term_count(x)
if (!isTRUE(attributes(x)[['amodel']])) {
warning(
paste0(
'\n`x` is not an expert rules model (i.e., it wasn\'t made by setting `grouping.var = TRUE`)\n',
'\nResults are likely wrong or will fail!'
), .call = FALSE
)
}
x_grp <- dplyr::bind_cols(group_cols(x), x[,'n.words', drop = FALSE])
if (!isTRUE(attributes(x)[['amodel']])) {
if ('id' %in% colnames(x_grp)) colnames(x_grp)[colnames(x_grp) %in% 'id'] <- 'id_temp_termco'
x_grp[['id']] <- seq_len(nrow(x_grp))
}
x_grp[['id']] <- as.character(x_grp[['id']])
out <- dplyr::left_join(
textshape::tidy_list(classify(x, n = Inf), 'id', 'tag'),
x_grp,
by = 'id'
)
if ('id_temp_termco' %in% colnames(out)) {
out[['id']] <- NULL
colnames(out)[colnames(out) %in% 'id_temp_termco'] <- 'id'
}
out <- dplyr::tbl_df(out)
## Add metatags data
if (isTRUE(check_meta_tags(x))) &&
) {
## merge meta tags onto tidy tags
out <- dplyr::left_join(out, attributes(x)[['metatags']]), by = 'tag')
## reorder to put meta tags before tags
out <- dplyr::bind_cols(out[, colnames(out) %in% 'tag', drop =FALSE],
out[, 'tag', drop =FALSE])
}
## add class
class(out) <- c('tidy_counts', class(out))
out
}
check_meta_tags <- function(x, ...){
if (is.null(attributes(x)[['metatags']]) | !isTRUE(attributes(x)[['metatags']])) return(FALSE)
if (!'tag' %in% colnames(attributes(x)[['metatags']]) ) {
type <- ifelse(is.null(attributes(x)[['tokens']]), 'term', 'token')
warning(paste0(
sprintf('`%s_count` object has a `metatags` attribute with no `tags` column.)', type),
'The `metatags` attribute will not be used'
), .call = FALSE)
return(FALSE)
}
return(TRUE)
}
trpl_list2 <- list(
list(
discourse_markers__t1.response_cries = c("\\boh", "\\bah", "\\baha", "\\bouch", "yuk"),
discourse_markers__t1.back_channels = c("uh[- ]huh", "uhuh", "yeah"),
discourse_markers__t1.summons = "hey",
discourse_markers__t2.justification = "because",
pos__t1.adverbs = '\\b\\w*[a-z]ly',
pos__t1.verbs = '\\b\\w+[a-z]ing',
pos__t1.articles = '\\b(the|an?)\\b',
pos__t2.conjunctions = '\\b(and|but|or)\\b',
pos__t3.pronouns = '\\b(hi[sm]|hers?|your?s?|th(em|ier))\\b',
people__t1.title = c('mr.', 'mister', 'president', 'govenor')
),
list(discourse_markers__t1.summons ='the'),
list(discourse_markers__t4.summons = 'it', other__t1.justification = 'ed\\s')
)
x <- with(presidential_debates_2012, term_count(dialogue, TRUE, trpl_list2, meta.sep = c('__', '.'), meta.names = c('meta_1', 'meta_2')))
attributes(x)[['metatags']]
token_list <- list(
list(
noun__w1.person = c('sam', 'i')
),
list(
noun__w2.place = c('here', 'house'),
noun__w3.thing = c('boat', 'fox', 'rain', 'mouse', 'box', 'eggs', 'ham')
),
list(
negative__w1.no_like = c('not like'),
noun__w3.thing = c('train', 'goat')
),
list(
other__w1.other = '^.*$'
)
)
(x <- token_count(sam_i_am, grouping.var = TRUE, token.list = token_list, meta.sep = c('__', '.'), meta.names = c('meta_1', 'meta_2')))
attributes(x)[['metatags']]
for testing
Finished up and closed by: https://github.com/trinker/termco/commit/47e9a824b7fdbfdc4abf1b64535efbe7b4f406c1
metatags
is an official attribute that can be used to group common tags
together. This is common in qualitative coding where one tags text and then
groups these subtags together into coherent metatags. This is used by
tidy_counts
and can be used by other future features.
in the categories list one could use
__
to indicate a level. For example:programs__r
,programs__python
,programs__visual_fortran
then these levels could be processed differently with various termco functions.