pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.94k stars 18.04k forks source link

ENH: Add decimal parameter to to_numeric #4674

Open cancan101 opened 11 years ago

cancan101 commented 11 years ago

Create a class / method for number parsing. This would be a generalization of the code that remove the thousands separators from numbers (see: https://github.com/cancan101/pandas/blob/1703ef44cd6b98e17c785c9120e29bbeefdefd1c/pandas/io/parsers.py#L1502).

Other items include:

  1. Alternative decimal point
  2. Using parenthesis for negative numbers
  3. Stripping currency character

I can submit PR for this. Do people have other suggestions for this?

jreback commented 11 years ago

there is a decimal parameter already, but I not sure in doc/source/io.rst (though in doc string)....maybe add that doc?

cancan101 commented 11 years ago

The idea of factoring out this functionality is so that it can be used for parsing numbers in other locations without having to go through the TextParser.

jreback commented 11 years ago

why would you not simply use replace for that, which can do all of what you are talking about? (if its not parser related)

jreback commented 11 years ago

http://pandas.pydata.org/pandas-docs/dev/missing_data.html#string-regular-expression-replacement

cancan101 commented 11 years ago

How would you use replace for the parenthesis? Bear in mind that the value then needs to be negated.

cancan101 commented 11 years ago

I suppose I could replace the parenthesis with a leading negative sign and then convert that string to an int/ float.

cancan101 commented 11 years ago

That being said I was thinking of places where I wanted to be able to take an arbitrary string which might not be in a DataFrame / Series and parse a number.

jreback commented 11 years ago

that would work; replace should infer the return types correctly

cancan101 commented 11 years ago

@jreback That addresses using replace for a DataFrame or Series but not in the general sense where we have a string to parse.

jreback commented 11 years ago

out of scope for pandas

cancan101 commented 11 years ago

Fair enough. I was just trying to factor out code that could be used both for pandas and for what is out of scope for pandas.

jreback commented 11 years ago

@cancan101 close this then? or do you want to try to do some of this in the actual parse itself? (which depending on the API/scope may of may not be useful). As its a small edge case and if you did have parens/dollar signs in your numbers you can just replace after you read it in, because there is a perf impact in even checking for it on more general parsing

cancan101 commented 11 years ago

I was going to / could write this such that there is no performance impact. I would instantiate a parser at the following location: https://github.com/cancan101/pandas/blob/1703ef44cd6b98e17c785c9120e29bbeefdefd1c/pandas/io/parsers.py#L1505 based on parsing options. Theoretically are a few more branches per FILE (not per cell or line).

jreback commented 11 years ago

Well, so you want to parse (value) -> -value and $numeric -> numeric?

this is JUST for PythonParser, right? (if so I am -1 on this); the c-parser is most often the default/used

cancan101 commented 11 years ago

What do you mean -1 on this? Like you don't like the change? Or you care less about the performance hit because this code path isn't the default?

cancan101 commented 11 years ago

And yes, what I was thinking was just the PythonParser

jreback commented 11 years ago

no, doing a change just for the PythonParser; the c-parser is the primary parser and getting them out of sync is IMHO a bad idea (as you can see by the recent fixes to put them BACK in sync). In reality the PythonParser is not even necessary (except for some back compat I think) (as the c-parser can parse virtually eveything)

cancan101 commented 11 years ago

This does not actually have to change the API for the PythonParser. All I am suggesting is that the underlying python code for the PythonParser is factored out and usable elsewhere.

jreback commented 11 years ago

for what purpose?

cancan101 commented 11 years ago

There are cases in the HTML parser that I want to be able to use this functionality.

jtratner commented 11 years ago

Can you show an example of input where this is necessary?

jreback commented 11 years ago

@cancan101 I would not be averse to seeing a suite of converters, ala`io/date_converters``, e.g. functions to do some sort of conversions like that

cancan101 commented 11 years ago

Having a set of converters seems reasonable. Do you think it makes sense then for the PythonParser to use one of these converters (ie the one that handles the thousands separator)?

jreback commented 11 years ago

when I mean converters, I really mean essentially a set of regular expressions/functions that you just pass to replace. You can certainly pass these via the converters argument if you REALLY want to. The thousands/decimal stuff seems more general purpose to me.

I mean you could argue that the parents/ignore dollar signs are general purpose too, but I have never seen people actually write those to csv's (and only use them as formatting).

cancan101 commented 11 years ago

Far more often I see these in HTML tables, for example: http://www.sec.gov/Archives/edgar/data/47217/000104746913006802/a2215416z10-q.htm#CCSCI I was just hoping to centralize the number parsing logic for CSV and for HTML in one place. It looks like logic for parsing HTML will need to be richer.

jreback commented 11 years ago

HTML is a bit of a different beast. @cpcloud can comment on this when he is back, but the API is meant to be somewhat similar to read_csv, but the backend is completely different

read_csv is eseentially line-oriented, why HTML parsing is not

cancan101 commented 11 years ago

@jreback We still have not really answered the question as to whether we should copy and paste the following code from parsers.py into a converter, or whether they should share the code:

        nonnum = re.compile('[^-^0-9^%s^.]+' % self.thousands)
        ...    
                if (not isinstance(x, compat.string_types) or
                    self.thousands not in x or
                        nonnum.search(x.strip())):
                    rl.append(x)
                else:
                    rl.append(x.replace(self.thousands, ''))
jtratner commented 11 years ago

@cancan101 also, keep in mind that it's often much simpler to change things around after converting into a DataFrame as opposed to trying to do it at the same time you parse the data. So if you have something like '(9)' as a string, it could be faster to allow it to come out as a string dtype and then manipulate it into integers.

cancan101 commented 11 years ago

@jtratner That makes sense. With that line of thinking, why even have this thousands parsing in any of the parsers? Why not deprecate that functionality / emulate it by loading string into a DataFrame and then performing a replace, etc? That would clean up (and speed up) the parser code, avoid issue like this, and make the various parsers more consistent?

jreback commented 11 years ago

@cancan101 its all a trade-off. More complicated parsing code by MUCH faster for in-line conversions, (e.g. having thousands separators) is not so uncommon, and once you take them out you are left with a float.

Doing after conversions is much more memory intensive (and time intensive), you have to scan twice (or more).

Getting good/great performance is hard. Always solve the problem first, then optimize. That said it is keeping in mind that certain bottlenecks can be dealt with by more complicated code.

cancan101 commented 11 years ago

@jreback Fair enough. I was going to say there is there is something to be said about keeping the parsers about extracting structure from the info (column, indexes, etc) and leaving extracting meaning (i.e. number parsing) to the post processing. This would make writing new parsers simpler.

jreback commented 11 years ago

@cancan101 you have a fair point; in this case it was all about performance when @wesm wrote the new parsers late last year. I am not sure of any gains in anew 'new' parsers thought. What 'new' parsers are needed?

cancan101 commented 11 years ago

At this point the only one on my horizon would be the XBRL parser (#4407), although I don't think that one will have issues with string data.

cancan101 commented 11 years ago

It also is not all about "new" parsers but maintaining and improving old parsers. This does contract jtratner's previous points about keeping the APIs simple. Admittedly there might be a valid performance reason here to do so.

jreback commented 11 years ago

@cancan101

a lot of times people want different API's for things but the implementation can be a single one (with say options) to support these API's

the parsers are the reverse

there is bascially a single API, with different implementations (e.g. most use read_csv machinery, but excel puts a layer on top, and HTML is completely different, as is the stata and JSON implementations for that matter)

cancan101 commented 11 years ago

Here the number of rows is small (<100), so for me, I can definitely go the "read as string and then apply more complicated parsing logic" without worrying about performance.

cancan101 commented 11 years ago

@jreback Another new parser would be the PDF parser #4556

jbrockmendel commented 6 years ago

Did a decision get reached on this?

jorisvandenbossche commented 6 years ago

I think it would be nice to add a decimal option to pandas.to_numeric

liverpool1026 commented 2 years ago

following onto this.

should to_numeric support cases like thousand seperator? Or that is better handled outside to_numeric.

The issue I am having is sometimes the data comes in as "1,200" instead of "1200" or 1200.

So when I call to_numeric

When I would want "1,200" to be coverted to 1200 when I call to_numeric.

AlexHodgson commented 1 year ago

take

AlexHodgson commented 1 year ago

I see that to_numeric() uses precise_xstrtod() from tokenizer.c to do the string to numeric conversion, this function already has functionality to handle different characters for decimal and thousands, but the normal separators are hardcoded in pd_parser.c where the call to precise_xstrtod() occurs. I could add parameters to to_numeric() for decimal and thousand separators like read_csv() has, and then pass these down to the underlying converter?