Open cancan101 opened 11 years ago
there is a decimal
parameter already, but I not sure in doc/source/io.rst (though in doc string)....maybe add that doc?
The idea of factoring out this functionality is so that it can be used for parsing numbers in other locations without having to go through the TextParser
.
why would you not simply use replace
for that, which can do all of what you are talking about? (if its not parser related)
How would you use replace for the parenthesis? Bear in mind that the value then needs to be negated.
I suppose I could replace the parenthesis with a leading negative sign and then convert that string to an int/ float.
That being said I was thinking of places where I wanted to be able to take an arbitrary string which might not be in a DataFrame
/ Series
and parse a number.
that would work; replace should infer the return types correctly
@jreback That addresses using replace for a DataFrame or Series but not in the general sense where we have a string to parse.
out of scope for pandas
Fair enough. I was just trying to factor out code that could be used both for pandas and for what is out of scope for pandas.
@cancan101 close this then? or do you want to try to do some of this in the actual parse itself? (which depending on the API/scope may of may not be useful). As its a small edge case and if you did have parens/dollar signs in your numbers you can just replace after you read it in, because there is a perf impact in even checking for it on more general parsing
I was going to / could write this such that there is no performance impact. I would instantiate a parser at the following location: https://github.com/cancan101/pandas/blob/1703ef44cd6b98e17c785c9120e29bbeefdefd1c/pandas/io/parsers.py#L1505 based on parsing options. Theoretically are a few more branches per FILE (not per cell or line).
Well, so you want to parse (value)
-> -value
and $numeric
-> numeric
?
this is JUST for PythonParser, right? (if so I am -1 on this); the c-parser is most often the default/used
What do you mean -1 on this? Like you don't like the change? Or you care less about the performance hit because this code path isn't the default?
And yes, what I was thinking was just the PythonParser
no, doing a change just for the PythonParser
; the c-parser is the primary parser and getting them out of sync is IMHO a bad idea (as you can see by the recent fixes to put them BACK in sync). In reality the PythonParser
is not even necessary (except for some back compat I think) (as the c-parser can parse virtually eveything)
This does not actually have to change the API for the PythonParser
. All I am suggesting is that the underlying python code for the PythonParser
is factored out and usable elsewhere.
for what purpose?
There are cases in the HTML parser that I want to be able to use this functionality.
Can you show an example of input where this is necessary?
@cancan101 I would not be averse to seeing a suite of converters
, ala`io/date_converters``, e.g. functions to do some sort of conversions like that
Having a set of converters seems reasonable. Do you think it makes sense then for the PythonParser
to use one of these converters (ie the one that handles the thousands separator)?
when I mean converters, I really mean essentially a set of regular expressions/functions that you just pass to replace
. You can certainly pass these via the converters
argument if you REALLY want to. The thousands/decimal
stuff seems more general purpose to me.
I mean you could argue that the parents/ignore dollar signs are general purpose too, but I have never seen people actually write those to csv's (and only use them as formatting).
Far more often I see these in HTML tables, for example: http://www.sec.gov/Archives/edgar/data/47217/000104746913006802/a2215416z10-q.htm#CCSCI I was just hoping to centralize the number parsing logic for CSV and for HTML in one place. It looks like logic for parsing HTML will need to be richer.
HTML
is a bit of a different beast. @cpcloud can comment on this when he is back, but the API is meant to be somewhat similar to read_csv
, but the backend is completely different
read_csv
is eseentially line-oriented, why HTML parsing is not
@jreback We still have not really answered the question as to whether we should copy and paste the following code from parsers.py
into a converter, or whether they should share the code:
nonnum = re.compile('[^-^0-9^%s^.]+' % self.thousands)
...
if (not isinstance(x, compat.string_types) or
self.thousands not in x or
nonnum.search(x.strip())):
rl.append(x)
else:
rl.append(x.replace(self.thousands, ''))
@cancan101 also, keep in mind that it's often much simpler to change things around after converting into a DataFrame
as opposed to trying to do it at the same time you parse the data. So if you have something like '(9)'
as a string, it could be faster to allow it to come out as a string dtype and then manipulate it into integers.
@jtratner That makes sense. With that line of thinking, why even have this thousands
parsing in any of the parsers? Why not deprecate that functionality / emulate it by loading string into a DataFrame
and then performing a replace, etc? That would clean up (and speed up) the parser code, avoid issue like this, and make the various parsers more consistent?
@cancan101 its all a trade-off. More complicated parsing code by MUCH faster for in-line conversions, (e.g. having thousands separators) is not so uncommon, and once you take them out you are left with a float.
Doing after conversions is much more memory intensive (and time intensive), you have to scan twice (or more).
Getting good/great performance is hard. Always solve the problem first, then optimize. That said it is keeping in mind that certain bottlenecks can be dealt with by more complicated code.
@jreback Fair enough. I was going to say there is there is something to be said about keeping the parsers about extracting structure from the info (column, indexes, etc) and leaving extracting meaning (i.e. number parsing) to the post processing. This would make writing new parsers simpler.
@cancan101 you have a fair point; in this case it was all about performance when @wesm wrote the new parsers late last year. I am not sure of any gains in anew 'new' parsers thought. What 'new' parsers are needed?
At this point the only one on my horizon would be the XBRL parser (#4407), although I don't think that one will have issues with string data.
It also is not all about "new" parsers but maintaining and improving old parsers. This does contract jtratner's previous points about keeping the APIs simple. Admittedly there might be a valid performance reason here to do so.
@cancan101
a lot of times people want different API's for things but the implementation can be a single one (with say options) to support these API's
the parsers are the reverse
there is bascially a single API, with different implementations (e.g. most use read_csv machinery, but excel puts a layer on top, and HTML is completely different, as is the stata and JSON implementations for that matter)
Here the number of rows is small (<100), so for me, I can definitely go the "read as string and then apply more complicated parsing logic" without worrying about performance.
@jreback Another new parser would be the PDF parser #4556
Did a decision get reached on this?
I think it would be nice to add a decimal
option to pandas.to_numeric
following onto this.
should to_numeric support cases like thousand seperator? Or that is better handled outside to_numeric.
The issue I am having is sometimes the data comes in as "1,200" instead of "1200" or 1200.
So when I call to_numeric
When I would want "1,200" to be coverted to 1200 when I call to_numeric.
take
I see that to_numeric() uses precise_xstrtod() from tokenizer.c to do the string to numeric conversion, this function already has functionality to handle different characters for decimal and thousands, but the normal separators are hardcoded in pd_parser.c where the call to precise_xstrtod() occurs. I could add parameters to to_numeric() for decimal and thousand separators like read_csv() has, and then pass these down to the underlying converter?
Create a class / method for number parsing. This would be a generalization of the code that remove the thousands separators from numbers (see: https://github.com/cancan101/pandas/blob/1703ef44cd6b98e17c785c9120e29bbeefdefd1c/pandas/io/parsers.py#L1502).
Other items include:
I can submit PR for this. Do people have other suggestions for this?