Feature to automatically tag useful sequences

usnistgov / nestor-tmp2

Quantifying tacit knowledge for investigatory analysis

Other

9 stars 5 forks source link

It would be useful to have a feature, similar to replace special words, that could take in a regex string and automatically tag items that match the regex string. This would be useful in situations where there are Sales Order numbers in a data set, that are not frequent occurrences, but still have meaning for an issue. For example if the number 123456-04 appeared in a comment but there was only one mention of this particular number, I still would like it to be tagged as an item. The format of the number will be consistent throughout all of the issues but will occur with low frequency. An example input to this feature would be

Regex String - r"\d{6}[-]\d{2}" Tag - [Item]

This would the denote that I would like all number sequences with that pattern to be labeled as an Item. Regex can be confusing to generate if the user is not familiar but websites like this one exist to help explain regex syntax. https://regex101.com/

As an added feature to this, it would be useful to exclude all numbers that do not match these regex strings from the tagging process. When I was tagging a data set from Aerotech's internal issue tracker, I was overwhelmed with unnecessary numbers from the comments on the issues, such as drive current, performance specifications, etc.

After a little discussion, I think this would be best implemented as an extension to the special_replace sub function. The underlying code is going through a couple changes before the full v0.3 release, but it shouldn't be too hard to implement something like this.

I'm thinking match groups could be mapped to (a list of) tags/alias', see pandas named-group functionality here:

Say we have the first letter for machine type, the next two specify which machine, and the last two which part/component of that machine...

s = pd.Series(['a1401', 'a1402', 'a1201', 'b0322'])
s.str.extract(r'(?P<part>(?P<asset>(?P<machine>[ab])[0-9]{2})[0-9]{2})')

>>> 
    part asset machine
0  a1401   a14       a
1  a1402   a14       a
2  a1201   a12       a
3  b0322   b03       b

Then the special replace would substitute each occurrence of the number with any of the matched groups+instances, so here, 'a1401' would be "special replaced" into part_a1301 asset_a14 machine_a

this should be pretty flexible, so that, say you don't really care to have a tag for the specific part, you could just not name the match group, and only asset/machine gets "special replaced".

@MichaelPBrundage , care to chime in?

usnistgov / nestor-tmp2

Feature to automatically tag useful sequences #59