pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.8k stars 17.98k forks source link

ENH: read_xml handling of bad lines #59384

Open davetapley opened 3 months ago

davetapley commented 3 months ago

Feature Type

Problem Description

Be able to read_xml and skip non-parseable lines.

E.g.

With:

<gage_rain id="" last_rpt="-999 -999" min_10="-999" min_30="-999" hour_1="-999" hour_3="-999" hour_6="-999" day_1="-999" day_3="-999" day_7="-999" day_30="-999" ytd="-999" null="-999" name="" lat=" -999" long="--999 " updated="2024-07-31 19:40:00" m1="-999" m2="-999" m3="-999" m4="-999" m5="-999" m6="-999" m7="-999" m8="-999" m9="-999" m10="-999" m11="-999" m12="-999"/>
<gage_rain id="470" last_rpt="2024-07-31 11:58:03" min_10="0.00" min_30="0.00" hour_1="0.00" hour_3="0.00" hour_6="0.00" day_1="0.00" day_3="0.00" day_7="0.67" day_30="1.93" ytd="12.25" null="-999" name="Lee Butte Precipitation" lat="34.83403" long="-111.53714" updated="2024-07-31 19:40:00" m1="1.93" m2="0.00" m3="1.45" m4="2.95" m5="1.54" m6="1.97" m7="0.86" m8="0.87" m9="0.00" m10="0.71" m11="2.87" m12="2.44"/>

If I:

dtype = {'id': str, 'lat': pd.Float32Dtype, 'long': pd.Float32Dtype}
df = pd.read_xml('fcdyc_alert_rain.xml', dtype=dtype)

I get:

  File "lib.pyx", line 2391, in pandas._libs.lib.maybe_convert_numeric
ValueError: Unable to parse string "--999 

Feature Description

https://github.com/pandas-dev/pandas/issues/15122 but for read_xml

Alternative Solutions

read_xml with no dtype kwarg, and manually manipulate the DataFrame afterwards.

Additional Context

No response

rhshadrach commented 3 months ago

Thanks for the request. I'm open to the addition of an errors argument as in read_csv, provided the implementation is straight forward (I haven't checked). If this causes anything more than negligible complexity in the algorithm however, I think we should cautiously reevaluate it.

jahn96 commented 3 months ago

take

jahn96 commented 2 months ago

@davetapley One clarifying question: it seems like read_csv has an option to specify what to do when encountering the bad line, but the bad line means a line with too many fields, not the line with non-parseable value documentation. Could you clarify what your expectation is? Also, Could you try your example again? I couldn't reproduce your issue with the same error. Thanks!

jahn96 commented 2 months ago

@davetapley Also, this seems more of the issue with the data not the XML parser itself since --999 can't be a float. Is your request to have a custom error handling with these data conversion errors?