Open adamist521 opened 9 years ago
Thanks. Nice to add dtypes
like read_csv
. I just saw a little, but it looks to be achieved by passing dtype
to TextParser
-> TextFileReader
.
PR is welcome:)
Just stumbled across this page with the same issue. @gte620v can you explain how to accomplish the raw html parsing given your PR? Thanks!
Should be something like this: https://github.com/gte620v/pandas/blob/5cb8243f2dd31cc2155627f29cfc89bbf6d4b84b/pandas/io/tests/test_html.py#L715
Just use a converter to convert to str
.
@stevenmanton ^
@gte620v thanks for the info. It sounds like you can easily convert back to string, but can't prevent the automatic parsing in the first place. For example, keeping the leading zeros in an integer. Thanks again!
@stevenmanton No, it does not convert back to string, it will prevent that it is parsed as numeric in the first place. In any case, leading zeros are preserved if you use converters={'col': str}
. See eg the example in the docs: http://pandas.pydata.org/pandas-docs/stable/io.html#reading-html-content (you have to scroll down a bit to
Specify converters for columns. This is useful for numerical text data that has leading zeros. By default columns that are numerical are cast to numeric types and the leading zeros are lost. To avoid this, we can convert these columns to strings.
url_mcc = 'https://en.wikipedia.org/wiki/Mobile_country_code' dfs = pd.read_html(url_mcc, match='Telekom Albania', header=0, converters={'MNC': str})
If you try that example, you will see that the leading zeros are preserved.
As @jorisvandenbossche said, the converter does what you want. I made the PR to solve this exact problem.
Thanks for the clarification guys. I saw "converter" I assumed it was parsing to string back from the inferred type. I'll use this fix :-)
Should we have "dtypes" be an alias for "converters", to match pd.read_csv argument ?
Yes, I think we should add a dtype
argument (not sure it should be an alias, it might be possible to just pass through dtype
to the underlying parser, now the python parser supports it: https://github.com/pandas-dev/pandas/pull/14295).
@adrivsh Want to do a PR for this?
@stevenmanton No, it does not convert back to string, it will prevent that it is parsed as numeric in the first place. In any case, leading zeros are preserved if you use
converters={'col': str}
. See eg the example in the docs: http://pandas.pydata.org/pandas-docs/stable/io.html#reading-html-content (you have to scroll down a bit toSpecify converters for columns. This is useful for numerical text data that has leading zeros. By default columns that are numerical are cast to numeric types and the leading zeros are lost. To avoid this, we can convert these columns to strings.
url_mcc = 'https://en.wikipedia.org/wiki/Mobile_country_code' dfs = pd.read_html(url_mcc, match='Telekom Albania', header=0, converters={'MNC': str})
If you try that example, you will see that the leading zeros are preserved.
I tried using your solulition:-
import pandas as pd
pd.read_html('https://www.gpw.pl/wskazniki',converters={'C/WK': str},header=0)[1]
But it removes the "," from the column values.
@tuhinsharma121 That seems like a bug (the returned values are strings, but indeed should not remove the ","). Could you open a new issue for that?
Same problem here. Looks like it tries to parse the numbers before converting them to strings.
A workaround is to pass thousands="ª", decimal="ª"
(or any other character not in text).
Any solution to the "," problem?
Use converters
converters = {
'col1': str,
'col2': str,
}
df = pd.read_html(str(table), converters=converters)
can I work on a PR for this?
When HTML table shows
01
in cell,read_html
reads it and interpret it as float and removes0
of01
. Options to read them as string?