pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.45k stars 17.87k forks source link

ENH: Add parameter to read_html() that disables the _remove_whitespace() function #59827

Open dstone42 opened 3 weeks ago

dstone42 commented 3 weeks ago

Feature Type

Problem Description

I have an html file that shows question names from a surveilance survey and how they have changed over the years to merge multiple files with slightly different names from those different years. I need to read in this file with the text exactly as it is in the html file so that those question names map exactly to the ones I mine from a pdf. The _remove_whitespace() function replaces all of the extra whitespace with single spaces, but there are some errors in these column names where they accidentally put two spaces or other similar things, and I need that text to match exactly, so I can properly clean the other files in the dataset.

Feature Description

Add a new parameter to the read_html() function that can disable the _remove_whitespace() function.

Alternative Solutions

The file I am using is originally md that I converted to html because I didn't think there was a way to read from md into pandas. I recently found this out, so instead of converting to html and reading from that, I am reading straight from the md. However, if my original file was html, I would probably create a similar solution of going back down to the read_table() function and manually making the changes I want to the cleaning.

def read_md(filename):

    table = pd.read_table(filename, sep='|').dropna(axis=1, how='all').iloc[1:]

    table.columns = table.columns.str.strip()

    for col in table.columns:

        table[col] = table[col].str.strip()

    return table.reset_index(drop=True)

Additional Context

No response

bruhnugget-nice commented 3 weeks ago

where can we find the read_html function? and also, may I take this?

rhshadrach commented 3 weeks ago

Thanks for the report. Can you supply an html table that reproduces the issue you are experiencing.

dstone42 commented 3 weeks ago

@rhshadrach Yes, here is an example. Unfortunately, html files are apparently not supported for uploading, so I just changed the file extension to txt, so you should be able to just change the extension back to html without any issues I think, HINTS Changes.txt

For some reason, when I read this in using read_html() it reads each column as its own dataframe and puts them in a list, but they are easily merged using pd.concat(). Once they're merged, see row 124 and column (label not index) 6. In the html file it is 'eCigUse: Electronic Cigarette Use (Derived from UsedECigEver and UseECigNow; see History Document for more information)', but it is read into the data frame as 'eCigUse: Electronic Cigarette Use (Derived from UsedECigEver and UseECigNow; see History Document for more information)'. You can see immediately after the colon, there are two spaces in the html document which are reduced to one space in the data frame.

I see how this is normally beneficial, but I just would like the option to turn it off in this specific case.

@bruhnugget-nice not sure what you're second question is, but here is the link to the read_html function: https://pandas.pydata.org/docs/reference/api/pandas.read_html.html

bruhnugget-nice commented 2 weeks ago

Hey, sorry for replying so late, but I have been looking through the read_html function,/ and from my perspective, it seems so complex, and I just don't understand what you are trying to do. Look at the file: https://github.com/pandas-dev/pandas/blob/2419343bfea5dba678146139ca9663d831c47b22/pandas/io/html.py#L1010

rhshadrach commented 2 weeks ago

Yes, here is an example.

Thanks for the example. When writing bug reports, it is appreciated to produce minimal examples. The file you've provided has lots of unnecessary javascript that can be stripped out. See https://matthewrocklin.com/minimal-bug-reports for more.

For some reason, when I read this in using read_html() it reads each column as its own dataframe and puts them in a list, but they are easily merged using pd.concat().

I cannot reproduce:

result = pd.read_html("HINTS.Changes.html")
print(result[0].columns)
# Index(['Question', '6', '5 Cycle 4', '5 Cycle 3', '5 Cycle 2', '5 Cycle 1'], dtype='object')

You can see immediately after the colon, there are two spaces in the html document which are reduced to one space in the data frame.

I believe this is standard for HTML - multiple spaces are treated as a single space. All modern browsers will take two or more spaces and display them as one. When spaces are important, there are various options such as &nbsp or <pre>.

It seems to me having a browser display the table one way and having pandas read it another way is undesirable.

dstone42 commented 2 weeks ago

@rhshadrach Sorry for my more complex example. I didn't think of that. I will attach a minimal example with this comment.

Test.txt

I'm not a web programmer, so I have not had a large amounts of interactions with HTML. However, in the experience I do have, browsers do not treat double spaces as single spaces. Try opening the test file I have provided in a browser. You will see that it preserves the double spaces in the text. Sorry if it's hard to tell. For me, the spaces render on the end of each line, but you can see they're both there if you highlight the text. If you get a different result, I am testing this on a Windows with Chrome.

Regardless of how browsers normally interpret it, I have a specific case where I would like to have a parameter that allows me to turn this functionality off. I don't think you need to change the default functionality because it definitely seems like your current functionality should be the default. On the other hand, it seems unlikely that this edge case will ever come up again for another user. I do think it's very easy to add this parameter, but I don't know what it's like to maintain a huge very popular python package, so I would not be upset if this is not important enough to be changed. I already have a functional work-around, so it truly won't be a big deal if this doesn't happen.

bruhnugget-nice commented 1 week ago

Alright, I'll try to help but am a bit confused.

image Is this what you are supposed to see?

rhshadrach commented 1 week ago

@dstone42: If you strip out all the Javascript / CSS, you do not get multiple spaces.

<!DOCTYPE html> 
<html>
<head><title>Test</title></head>
<table>
<thead>
<tr>
<th dir="ltr">Question</th>
<th dir="auto">6</th>
<th dir="ltr">5 Cycle 4</th>
<th dir="ltr">5 Cycle 3</th>
<th dir="ltr">5 Cycle 2</th>
<th dir="ltr">5 Cycle 1</th>
</tr>
</thead>
<tbody>
<tr>
<td dir="ltr">Electronic Cigarette Use (Composite of UsedECigEver and UseECigNow)</td>
<td dir="ltr">eCigUse:  Electronic Cigarette Use (Derived from UsedECigEver and UseECigNow; see History Document for more information)</td>
<td dir="ltr">eCigUse.  Electronic Cigarette Use (Derived from UsedECigEver and UseECigNow; see History Document for more information)</td>
<td dir="ltr">eCigUse.  Electronic Cigarette Use (Derived from UsedECigEver and UseECigNow; see History Document for more information)</td>
<td dir="ltr">eCigUse.  Electronic Cigarette Use (Derived from UsedECigEver and UseECigNow; see History Document for more information)</td>
<td dir="ltr">eCigUse.  Electronic Cigarette Use (Derived from ElectCigLessHarm, UsedECigEver, and UseECigNow; see History Document for more information)</td>
</tr>
</tbody>
</table>
</body>
</html>

From the first column:

eCigUse: Electronic Cigarette Use

That said, such an argument does seem like it could be generally beneficial for certain use cases. I would be supportive of such an argument if the implementation is light-weight.

Further investigations are welcome!