Enhancement of xarray.Dataset.from_dataframe

loco-philippe commented 5 months ago

Is your feature request related to a problem?

The current xarray.Dataset.from_dataframe method converts DataFrame columns corresponding to non-index coordinates into variables as explained in the user-guide.

This solution is not optimal because it does not recover the structure of the initial data. It also creates large datasets.

The user-guide example is below:

In [1]: ds = xr.Dataset(
              {"foo": (("x", "y"), np.random.randn(2, 3))},
              coords={
                  "x": [10, 20],
                  "y": ["a", "b", "c"],
                  "along_x": ("x", np.random.randn(2)),
                  "scalar": 123,
              },
         )
         ds
Out[1]: 
<xarray.Dataset> Size: 88B
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int32 8B 10 20
  * y        (y) <U1 12B 'a' 'b' 'c'
    along_x  (x) float64 16B -0.03376 0.8059
    scalar   int32 4B 123
Data variables:
    foo      (x, y) float64 48B -1.811 0.4769 0.1201 -0.0352 -1.61 -1.287

In [2]: df = ds.to_dataframe()
        xr.Dataset.from_dataframe(df)
Out[2]:
<xarray.Dataset> Size: 152B
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int32 8B 10 20
  * y        (y) object 24B 'a' 'b' 'c'
Data variables:
    foo      (x, y) float64 48B -1.811 0.4769 0.1201 -0.0352 -1.61 -1.287
    along_x  (x, y) float64 48B -0.03376 -0.03376 -0.03376 0.8059 0.8059 0.8059
    scalar   (x, y) int32 24B 123 123 123 123 123 123

/

Describe the solution you'd like

If we analyse the relationships between columns, we can distinguish between variables, dims coordinates and non_dims coordinates.

In the example above, the round-trip conversion with npd return also the same dataset:

In [3]: df.npd.to_xarray()
Out[3]: 
<xarray.Dataset> Size: 88B
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int32 8B 10 20
  * y        (y) <U1 12B 'a' 'b' 'c'
    along_x  (x) float64 16B -0.03376 0.8059
    scalar   int32 4B 123
Data variables:
    foo      (x, y) float64 48B -1.811 0.4769 0.1201 -0.0352 -1.61 -1.287

Note:

npd is the ntv_pandas package (present in the pandas ecosystem). This package is capable of converting complex DataFrame (see examples).

Describe alternatives you've considered

Three options are available to you to have an efficient converter,

option 1: maintain the current xarray.Dataset.from_dataframe and use the npd third-party solution to have an optimized converter
option 2: reuse the analysis package to find dims, coordinates and variables, then modify the xarray.Dataset.from_dataframe method to generate a dataset,
option 3: include the analysis functions in the xarray.Dataset.from_dataframe method

It seems to me that the option 3 is complex. The option 1 and option 2 are possible

Additional context

The analysis (package tab_analysis) applied to the example above gives the results below:

In [4]: analys = df.reset_index().npd.analysis(distr=True)
        analys.partitions()
Out[4]: [['x', 'y'], ['foo']] # two partitions (dims) are found

In [5]: analys.field_partition() # use the first partition : ['x', 'y']
Out[5]: 
{'primary': ['x', 'y'],
 'secondary': ['along_x'],
 'mixte': [],
 'unique': ['scalar'],
 'variable': ['foo']}

In [6]: analys.relation_partition()
Out[6]: {'x': ['x'], 'y': ['y'], 'along_x': ['x'], 'scalar': [], 'foo': ['x', 'y']}

max-sixty commented 5 months ago

This looks very cool!

I think the first thing we could do is add a link to the library from the documentation — at least the from_dataframe method...

loco-philippe commented 5 months ago

@max-sixty

Thank-you Maximilian for your quick response !

Yes it's a good idea, do you need any additional information for this ?

By the way, i'm looking to see if another theory of tabular structure analysis (see presentation) exists but I can't find references. Do you have some contacts or some references about that ?

max-sixty commented 5 months ago

Yes it's a good idea, do you need any additional information for this ?

This would be a PR you could make to the docs!

loco-philippe commented 5 months ago

OK, that's perfect!

I will prepare a modification of the 'doc/user-guide/pandas.rst' file and then include it in a PR.

Can you confirm that it is not necessary to create a development environment?

max-sixty commented 5 months ago

Can you confirm that it is not necessary to create a development environment?

No it shouldn't be required!

pydata / xarray