pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.78k stars 17.97k forks source link

get start and end of regex match in dataframe #8747

Open teese opened 10 years ago

teese commented 10 years ago

What about including a method to get the start and stop after a regex search of items in a DataFrame . Perhaps using .str.extract?

Returning the start as a new column would perhaps be as follows:

df['start'] = df['string'].str.extract(pattern, output = 'start')

an alternative suggestion from jkitchen on StackOverflow was to use start_index = True, or end_index = True

df['start'] = df['string'].str.extract(pattern, start_index = True)

For multiple parameters (e.g. start and end) as outputs, there needs to be a way to avoid running the search twice. One solution would be to give the output as a tuple:

df['regex_output_tuple'] = df['string'].str.extract(pattern, output = ('start','end'))

I don't use regex very often, so I don't know if there are other parameters that people want after a regex search. If there really is just the text in the groups, the start and the end, perhaps there's a way to put the output directly into new columns?

df['groups'], df['start'], df['end']  = df['string'].str.extract(pattern, output = ('groups','start','end'))

I think it makes sense that non-matches return a NaN, just as in the regular extract function. This would mix integer and float datatypes in the df['start'] column, but I guess we all know about that situation :)

I'm not an experienced programmer, so sorry if I misunderstood some basic concepts.

Please see the question in StackOverflow for example code and comments: http://stackoverflow.com/questions/26658213/how-can-i-find-the-start-and-end-of-a-regex-match-using-a-python-pandas-datafram

A block of example data and code is below, as requested by jreback.

import pandas as pd
import re
#some example query sequences, markup strings, hit sequences.
q1,q2,q3 = 'MPIMGSSVYITVELAIAVLAILG','MPIMGSSVYITVELAIAVLAILG','MPI-MGSSVYITVELAIAVLAIL'
m1,m2,m3 = '|| ||  ||||||||||||||||','||   | ||| :|| || |:: |','||:    ::|: :||||| |:: '
h1,h2,h3 = 'MPTMGFWVYITVELAIAVLAILG','MP-NSSLVYIGLELVIACLSVAG','MPLETQDALYVALELAIAALSVA' 
#create a pandas dataframe to hold the aligned sequences
df = pd.DataFrame({'query':[q1,q2,q3],'markup':[m1,m2,m3],'hit':[h1,h2,h3]})

#create a regex search string to find the appropriate subset in the query sequence, 
desired_region_from_query = 'PIMGSS'
regex_desired_region_from_query = '(P-*I-*M-*G-*S-*S-*)'

#Pandas has a nice extract function to slice out the matched sequence from the query:
df['extracted'] = df['query'].str.extract(regex_desired_region_from_query)

#However I need the start and end of the match in order to extract the equivalent regions 
#from the markup and hit columns. For a single string, this is done as follows:
match = re.search(regex_desired_region_from_query, df.loc[2,'query'])
sliced_hit = df.loc[2,'hit'][match.start():match.end()]
print('sliced_hit, non-vectorized example: ', sliced_hit)

#HERE the new syntax is necessary
#e.g. df['start'], df['end']  = df['string'].str.extract(pattern, output = ('start','end'))

#My current workaround in pandas is as follows.
#define function to obtain regex output (start, stop, etc) as a tuple
def get_regex_output(x):
    m = re.search(regex_desired_region_from_query, x)
    return (m.start(), m.end())
#apply function
df['regex_output_tuple'] = df['query'].apply(get_regex_output)
#convert the tuple into two separate columns
columns_from_regex_output = ['start','end']      
for n, col in enumerate(columns_from_regex_output):
    df[col] = df['regex_output_tuple'].apply(lambda x: x[n])
#delete the unnecessary column
df = df.drop('regex_output_tuple', axis=1)
jreback commented 10 years ago

can you provide a short but specific example of what exactly is needed/wanted here (make runnable as much as possible and indicate where syntax is needed)

teese commented 10 years ago

I've added the code to the question as requested. Sorry it's not short, but contains some real data to help explain why it is necessary to obtain the regex start and end. The code should work in both python 2.7 and 3.4, and the latest pandas release (0.15.0). In my case, I will apply the above workaround to ~5000 dataframes, each containing ~5000 rows, with significantly longer sequences (~500 characters in each string).

jreback commented 10 years ago

so it actually sounds like you want a function like extract but returns the matched indices.

eg. df['query'].indices(regex_desired_region_from_query, outtype='list|frame')

subtle issue is whether the match can return just (start,end) or a list of matches (not sure what that would look like)

teese commented 10 years ago

Create .indices as another function? It's an interesting idea, but I'd have to admit I'm already confused with the .match, .extract, .contains functions that already exist.

Beginners learn to apply regex to single strings using the following syntax (from https://docs.python.org/3.4/library/re.html):

text = "He was carefully disguised but captured quickly by police."
for m in re.finditer(r"\w+ly", text):
    print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))
07-16: carefully
40-47: quickly

As a beginner, I am happiest when the syntax in pandas matches the original syntax as closely as possible. The .extract function works great, but after looking at the discussion in #5075, I would probably have voted to keep the name .match, replace the legacy code with the new extract function, and change the output (group, bool, index, or a combination) based on various arguments.

Currently when someone wants to get three things: the groups, the start index and end index. The only way this can be done without repeating the regex search is to get the indices first and then apply some lambda functions to slice out the group. This is a very different process to what people are accustomed to from using the original re module.

So in summary, in order of my preferences: 1) incorporate extract and proposed get-indices into str.match (to me the simplest for new users, but involves reopening an old discussion and worrying about backwards compatibility) 2) incorporate get-indices function into str.match, but leave the current default output as 'bool' (as planned) 3) create a new str.indices function

What're your thoughts concerning the first two options?

Regarding your second comment as to whether the match can return just (start,end) or a list of matches, I still have to sit down and think about that one :)

edumotya commented 3 years ago

This is my workaround for named groups:

import re
import pandas as pd

class SpanExtractor:
    def __init__(self, pattern):
        self._pattern = re.compile(pattern)
        self._groups = list(self._pattern.groupindex.keys())

    def __call__(self, x):
        """
        Utility function to extract the start and end indices.
        """
        m = self._pattern.search(x)
        if m:
            span_groups = {g: m.span(g) for g in self._groups}
        else:
            span_groups = {g: (float("nan"), float("nan")) for g in self._groups}
        return pd.Series(span_groups)

def _extract_spans(ds: pd.Series, pattern: str) -> pd.DataFrame:
    span_extractor = SpanExtractor(pattern)
    spans = ds.apply(span_extractor)
    spans = pd.concat(
        [
            pd.DataFrame(
                spans[col].to_list(),
                columns=["start_index_" + col, "end_index_" + col],
                index=spans.index,
            )
            for col in spans
        ],
        axis="columns",
    )
    return spans

spans = _extract_spans(df["text"], pattern)
spans["start_index_{your_named_group_1}"] 
spans["end_index_{your_named_group_1}"] 
spans["start_index_{your_named_group_2}"] 
vsocrates commented 1 year ago

Hi, I'd love to see a solution to this and it seems like fairly expected functionality, given the way that the re module works.

It looks like an edit to the _str_extract function here may be the start to a fix, but it seems like there would be issues with backwards compatibility or impacts on other functions.

if not expand:

    def g(x):
        m = regex.search(x)
        return m.groups()[0] if m else na_value

    return self._str_map(g, convert=False)

I'd be willing to take a stab at it if someone can provide me with some more direction (unless there's plans to implement this in a future release that I missed)?

vsocrates commented 1 year ago

Hi, following up on this! @mroeschke, wondering why the "Contributions Welcome" milestone was taken off and if this is still up for contributions, thanks!

GolAGitHub commented 1 year ago

I frequently wish I had access to regex match object methods when using str.extract/str.extractall. Is this still under consideration for a new release?

delucca commented 1 year ago

+1 for this