pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.57k stars 17.9k forks source link

No way to force read numerics as string in `read_html ` #10534

Open adamist521 opened 9 years ago

adamist521 commented 9 years ago

When HTML table shows 01 in cell, read_html reads it and interpret it as float and removes 0 of 01 . Options to read them as string?

sinhrks commented 9 years ago

Thanks. Nice to add dtypes like read_csv. I just saw a little, but it looks to be achieved by passing dtype to TextParser -> TextFileReader.

PR is welcome:)

stevenmanton commented 7 years ago

Just stumbled across this page with the same issue. @gte620v can you explain how to accomplish the raw html parsing given your PR? Thanks!

gte620v commented 7 years ago

Should be something like this: https://github.com/gte620v/pandas/blob/5cb8243f2dd31cc2155627f29cfc89bbf6d4b84b/pandas/io/tests/test_html.py#L715

Just use a converter to convert to str.

@stevenmanton ^

stevenmanton commented 7 years ago

@gte620v thanks for the info. It sounds like you can easily convert back to string, but can't prevent the automatic parsing in the first place. For example, keeping the leading zeros in an integer. Thanks again!

jorisvandenbossche commented 7 years ago

@stevenmanton No, it does not convert back to string, it will prevent that it is parsed as numeric in the first place. In any case, leading zeros are preserved if you use converters={'col': str}. See eg the example in the docs: http://pandas.pydata.org/pandas-docs/stable/io.html#reading-html-content (you have to scroll down a bit to

Specify converters for columns. This is useful for numerical text data that has leading zeros. By default columns that are numerical are cast to numeric types and the leading zeros are lost. To avoid this, we can convert these columns to strings.

url_mcc = 'https://en.wikipedia.org/wiki/Mobile_country_code'
dfs = pd.read_html(url_mcc, match='Telekom Albania', header=0, converters={'MNC': str})

If you try that example, you will see that the leading zeros are preserved.

gte620v commented 7 years ago

As @jorisvandenbossche said, the converter does what you want. I made the PR to solve this exact problem.

stevenmanton commented 7 years ago

Thanks for the clarification guys. I saw "converter" I assumed it was parsing to string back from the inferred type. I'll use this fix :-)

adrivsh commented 7 years ago

Should we have "dtypes" be an alias for "converters", to match pd.read_csv argument ?

jorisvandenbossche commented 7 years ago

Yes, I think we should add a dtype argument (not sure it should be an alias, it might be possible to just pass through dtype to the underlying parser, now the python parser supports it: https://github.com/pandas-dev/pandas/pull/14295). @adrivsh Want to do a PR for this?

tuhinsharma121 commented 4 years ago

@stevenmanton No, it does not convert back to string, it will prevent that it is parsed as numeric in the first place. In any case, leading zeros are preserved if you use converters={'col': str}. See eg the example in the docs: http://pandas.pydata.org/pandas-docs/stable/io.html#reading-html-content (you have to scroll down a bit to

Specify converters for columns. This is useful for numerical text data that has leading zeros. By default columns that are numerical are cast to numeric types and the leading zeros are lost. To avoid this, we can convert these columns to strings.

url_mcc = 'https://en.wikipedia.org/wiki/Mobile_country_code'
dfs = pd.read_html(url_mcc, match='Telekom Albania', header=0, converters={'MNC': str})

If you try that example, you will see that the leading zeros are preserved.

I tried using your solulition:-

import pandas as pd
pd.read_html('https://www.gpw.pl/wskazniki',converters={'C/WK': str},header=0)[1]

But it removes the "," from the column values.

jorisvandenbossche commented 4 years ago

@tuhinsharma121 That seems like a bug (the returned values are strings, but indeed should not remove the ","). Could you open a new issue for that?

jbsilva commented 4 years ago

Same problem here. Looks like it tries to parse the numbers before converting them to strings. A workaround is to pass thousands="ª", decimal="ª" (or any other character not in text).

mominali12 commented 3 years ago

Any solution to the "," problem?

clehene commented 1 year ago

Use converters

converters = {
        'col1': str,
        'col2': str,
}
df = pd.read_html(str(table), converters=converters)
tuhinsharma121 commented 6 months ago

can I work on a PR for this?