Open teese opened 10 years ago
can you provide a short but specific example of what exactly is needed/wanted here (make runnable as much as possible and indicate where syntax is needed)
I've added the code to the question as requested. Sorry it's not short, but contains some real data to help explain why it is necessary to obtain the regex start and end. The code should work in both python 2.7 and 3.4, and the latest pandas release (0.15.0). In my case, I will apply the above workaround to ~5000 dataframes, each containing ~5000 rows, with significantly longer sequences (~500 characters in each string).
so it actually sounds like you want a function like extract but returns the matched indices.
eg. df['query'].indices(regex_desired_region_from_query, outtype='list|frame')
subtle issue is whether the match can return just (start,end)
or a list of matches (not sure what that would look like)
Create .indices
as another function?
It's an interesting idea, but I'd have to admit I'm already confused with the .match
, .extract
, .contains
functions that already exist.
Beginners learn to apply regex to single strings using the following syntax (from https://docs.python.org/3.4/library/re.html):
text = "He was carefully disguised but captured quickly by police."
for m in re.finditer(r"\w+ly", text):
print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))
07-16: carefully
40-47: quickly
As a beginner, I am happiest when the syntax in pandas matches the original syntax as closely as possible. The .extract
function works great, but after looking at the discussion in #5075, I would probably have voted to keep the name .match
, replace the legacy code with the new extract function, and change the output (group, bool, index, or a combination) based on various arguments.
Currently when someone wants to get three things: the groups, the start index and end index. The only way this can be done without repeating the regex search is to get the indices first and then apply some lambda functions to slice out the group. This is a very different process to what people are accustomed to from using the original re module.
So in summary, in order of my preferences:
1) incorporate extract and proposed get-indices into str.match
(to me the simplest for new users, but involves reopening an old discussion and worrying about backwards compatibility)
2) incorporate get-indices function into str.match
, but leave the current default output as 'bool' (as planned)
3) create a new str.indices
function
What're your thoughts concerning the first two options?
Regarding your second comment as to whether the match can return just (start,end) or a list of matches, I still have to sit down and think about that one :)
This is my workaround for named groups:
import re
import pandas as pd
class SpanExtractor:
def __init__(self, pattern):
self._pattern = re.compile(pattern)
self._groups = list(self._pattern.groupindex.keys())
def __call__(self, x):
"""
Utility function to extract the start and end indices.
"""
m = self._pattern.search(x)
if m:
span_groups = {g: m.span(g) for g in self._groups}
else:
span_groups = {g: (float("nan"), float("nan")) for g in self._groups}
return pd.Series(span_groups)
def _extract_spans(ds: pd.Series, pattern: str) -> pd.DataFrame:
span_extractor = SpanExtractor(pattern)
spans = ds.apply(span_extractor)
spans = pd.concat(
[
pd.DataFrame(
spans[col].to_list(),
columns=["start_index_" + col, "end_index_" + col],
index=spans.index,
)
for col in spans
],
axis="columns",
)
return spans
spans = _extract_spans(df["text"], pattern)
spans["start_index_{your_named_group_1}"]
spans["end_index_{your_named_group_1}"]
spans["start_index_{your_named_group_2}"]
Hi, I'd love to see a solution to this and it seems like fairly expected functionality, given the way that the re
module works.
It looks like an edit to the _str_extract
function here may be the start to a fix, but it seems like there would be issues with backwards compatibility or impacts on other functions.
if not expand:
def g(x):
m = regex.search(x)
return m.groups()[0] if m else na_value
return self._str_map(g, convert=False)
I'd be willing to take a stab at it if someone can provide me with some more direction (unless there's plans to implement this in a future release that I missed)?
Hi, following up on this! @mroeschke, wondering why the "Contributions Welcome" milestone was taken off and if this is still up for contributions, thanks!
I frequently wish I had access to regex match object methods when using str.extract/str.extractall. Is this still under consideration for a new release?
+1 for this
What about including a method to get the start and stop after a regex search of items in a DataFrame . Perhaps using .str.extract?
Returning the start as a new column would perhaps be as follows:
an alternative suggestion from jkitchen on StackOverflow was to use start_index = True, or end_index = True
For multiple parameters (e.g. start and end) as outputs, there needs to be a way to avoid running the search twice. One solution would be to give the output as a tuple:
I don't use regex very often, so I don't know if there are other parameters that people want after a regex search. If there really is just the text in the groups, the start and the end, perhaps there's a way to put the output directly into new columns?
I think it makes sense that non-matches return a NaN, just as in the regular extract function. This would mix integer and float datatypes in the df['start'] column, but I guess we all know about that situation :)
I'm not an experienced programmer, so sorry if I misunderstood some basic concepts.
Please see the question in StackOverflow for example code and comments: http://stackoverflow.com/questions/26658213/how-can-i-find-the-start-and-end-of-a-regex-match-using-a-python-pandas-datafram
A block of example data and code is below, as requested by jreback.