df.convert_objects removed from pandas

philipstarkey commented 5 years ago

Original report (archived issue) by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

pandas 0.25 has dropped DataFrame.convert_obects(), resulting in an exception from the server when getting the dataframe using lyse.data().

AttributeError: 'DataFrame' object has no attribute 'convert objects'

Discussion about the deprecation and removal here:

https://github.com/pandas-dev/pandas/issues/11221

As a reminder, we're using this function to convert columns of the dataframe from Python objects into numpy/pandas dtypes where possible, which makes the dataframe faster to pickle and send over the wire.

We'll need to decide on what to do. It is possible that the performance reason for doing convert_objects() may no longer be as important as performance in other relevant components may have improved, though it is still a semantic change to be returning dataframes where the numpy arrays pulled out of them are of dtype object containing Python floats instead of being dtype float as expected.

It seems like the alternatives to convert_objects may require explicitly saying the type of each column, which would be super annoying. But I'll look into it and see if we can replicate the current behaviour using the alternatives.

philipstarkey commented 5 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

After simply removing the convert_objects call and looking at some dataframes, it looks like the dtype of columns are pretty much what you would expect - they are specific datatypes and not all of type object. So convert_objects is not doing much work - they are already floats and ints and datetimes when appropriate, and are only object when there is genuinely mixed data.

Behaviour is slightly imperfect since you can accidentally make a column a mixed dtype by saving an analysis result that is a different datatype than all other shots, and then changing and re-running analysis (or removing the shot) such that the column contains all the same datatype again, but the column will still remain dtype object as if the data were mixed.

The lyse update_row() method is already converting to a column to dtype object when it gets a datatype incompatible with the current datatype of a column. So some code could be added to check when the dtype of an element changes and check if it can convert back to specific datatypes. But this would involve looping over the whole dataframe or at least whole columns, and I'm hesitant to do it.

I think we should just remove the call to convert_objects and see how it goes.

See PR #70

philipstarkey commented 5 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

Ah, actually looks like there is a better way, infer_objects:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.infer_objects.html

This is not deprecated and does what we want. Pull request updated.

labscript-suite-temp-2 / lyse

df.convert_objects removed from pandas #52