trinker / termco

Regular Expression Counts of Terms and Substrings
Other
25 stars 5 forks source link

incorporate multi level tags #29

Closed trinker closed 6 years ago

trinker commented 7 years ago

in the categories list one could use __ to indicate a level. For example:

programs__r, programs__python, programs__visual_fortran

then these levels could be processed differently with various termco functions.

trinker commented 6 years ago

For now this could be stored with termco as a metatag attribute...essentially as a hash table of sub tags and parent tags (2 columns with the meta and sub. This can be added in one of two ways...via the __ or what ever separator specified OR the user can pass it in. This would have a add_metatags(termco_obj, tags) function [returns termco] and add_metatags<- [act directly on the termco obj in place] for manually adding the tags afterward. The former would look for the separator on every termin term_count and if found this is extracted and the metatags hash is created.

It also makes sense to have a tidy_term_count object for turning the object into long format...the tidy would look for the metatags attribute and if found put that on the output...then plot methods like discrimination and distribution and co_occurence would be available by default. It'd be nice if the ggplot code that was used to make these sorts of plots was returned as well for easy modification. Note: this part didn't make sense after actually hooking up the plumbing since the original rows of those without a tag are dropped

trinker commented 6 years ago

metatags could take multi column frame that would be multiple parents including nested...just so long as every term in term_count is matched (otherwise warning thrown and thrown quick before term_count runs). This could be accomplished by using 2+ separators (I could detect this but that's a lot of extra work and problem prone) or by passing the multiple columns to add_metatags directly.

trinker commented 6 years ago

collapse_tags & update_names rename_tags would need to remake the metatags as would drop_tags select_tags if this ever becomes a function. [went with select_counts as grouping.vars could be selected as well; used counts b/c this is a counts table.

Need to happen in term_count and token_count

Note

For the most part is is dangerous to alter the metatags after altering column names and they are dropped instead with a warning

trinker commented 6 years ago

MUST have a column named tag in metatags attribute

trinker commented 6 years ago
validate_term_count <- termco:::validate_term_count

## test for term and token counts
tidy_counts <- function(x, n = Inf, ...){

    validate_term_count(x) 
    if (!isTRUE(attributes(x)[['amodel']])) {
        warning(
            paste0(
                '\n`x` is not an expert rules model (i.e., it wasn\'t made by setting `grouping.var = TRUE`)\n',
                '\nResults are likely wrong or will fail!'
            ), .call = FALSE
        )
    }

    x_grp <- dplyr::bind_cols(group_cols(x), x[,'n.words', drop = FALSE])

    if (!isTRUE(attributes(x)[['amodel']])) {
        if ('id' %in% colnames(x_grp)) colnames(x_grp)[colnames(x_grp) %in% 'id'] <- 'id_temp_termco'
        x_grp[['id']] <- seq_len(nrow(x_grp))
    }

    x_grp[['id']] <- as.character(x_grp[['id']])

    out <- dplyr::left_join(
        textshape::tidy_list(classify(x, n = Inf), 'id', 'tag'), 
        x_grp, 
        by = 'id'
    )

    if ('id_temp_termco' %in% colnames(out)) {
        out[['id']] <- NULL
        colnames(out)[colnames(out) %in% 'id_temp_termco'] <- 'id'
    }

    out <- dplyr::tbl_df(out)

    ## Add metatags data 
    if (isTRUE(check_meta_tags(x))) &&
    ) {
        ## merge meta tags onto tidy tags
        out <- dplyr::left_join(out, attributes(x)[['metatags']]), by = 'tag')

        ## reorder to put meta tags before tags  
        out <- dplyr::bind_cols(out[, colnames(out) %in% 'tag', drop =FALSE], 
            out[, 'tag', drop =FALSE])
    }

    ## add class
    class(out) <- c('tidy_counts', class(out))

    out

}

check_meta_tags <- function(x, ...){
    if (is.null(attributes(x)[['metatags']]) | !isTRUE(attributes(x)[['metatags']])) return(FALSE)
    if (!'tag' %in% colnames(attributes(x)[['metatags']]) ) {
        type <- ifelse(is.null(attributes(x)[['tokens']]), 'term', 'token')
        warning(paste0(
            sprintf('`%s_count` object has a `metatags` attribute with no `tags` column.)', type),
            'The `metatags` attribute will not be used'
        ), .call = FALSE)
        return(FALSE)
    }
    return(TRUE)
}
trinker commented 6 years ago
trpl_list2 <- list(
    list(
        discourse_markers__t1.response_cries = c("\\boh", "\\bah", "\\baha", "\\bouch", "yuk"),
        discourse_markers__t1.back_channels = c("uh[- ]huh", "uhuh", "yeah"),
        discourse_markers__t1.summons = "hey",
        discourse_markers__t2.justification = "because",
        pos__t1.adverbs = '\\b\\w*[a-z]ly',
        pos__t1.verbs = '\\b\\w+[a-z]ing',
        pos__t1.articles = '\\b(the|an?)\\b',
        pos__t2.conjunctions = '\\b(and|but|or)\\b',
        pos__t3.pronouns = '\\b(hi[sm]|hers?|your?s?|th(em|ier))\\b',
        people__t1.title = c('mr.', 'mister', 'president', 'govenor')
    ),
    list(discourse_markers__t1.summons ='the'),
    list(discourse_markers__t4.summons = 'it', other__t1.justification = 'ed\\s')
)

x <- with(presidential_debates_2012, term_count(dialogue, TRUE, trpl_list2, meta.sep = c('__', '.'), meta.names = c('meta_1', 'meta_2')))
attributes(x)[['metatags']]

token_list <- list(
    list(
        noun__w1.person = c('sam', 'i')
    ),
    list(
        noun__w2.place = c('here', 'house'),
        noun__w3.thing = c('boat', 'fox', 'rain', 'mouse', 'box', 'eggs', 'ham')
    ),
    list(
        negative__w1.no_like = c('not like'),
        noun__w3.thing = c('train', 'goat')
    ),
    list(
        other__w1.other = '^.*$'
    )
)

(x <- token_count(sam_i_am, grouping.var = TRUE, token.list = token_list, meta.sep = c('__', '.'), meta.names = c('meta_1', 'meta_2')))
attributes(x)[['metatags']]

for testing

trinker commented 6 years ago

Finished up and closed by: https://github.com/trinker/termco/commit/47e9a824b7fdbfdc4abf1b64535efbe7b4f406c1