pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.61k stars 1.08k forks source link

Enhancement of xarray.Dataset.from_dataframe #9015

Open loco-philippe opened 5 months ago

loco-philippe commented 5 months ago

Is your feature request related to a problem?

The current xarray.Dataset.from_dataframe method converts DataFrame columns corresponding to non-index coordinates into variables as explained in the user-guide.

This solution is not optimal because it does not recover the structure of the initial data. It also creates large datasets.

The user-guide example is below:

In [1]: ds = xr.Dataset(
              {"foo": (("x", "y"), np.random.randn(2, 3))},
              coords={
                  "x": [10, 20],
                  "y": ["a", "b", "c"],
                  "along_x": ("x", np.random.randn(2)),
                  "scalar": 123,
              },
         )
         ds
Out[1]: 
<xarray.Dataset> Size: 88B
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int32 8B 10 20
  * y        (y) <U1 12B 'a' 'b' 'c'
    along_x  (x) float64 16B -0.03376 0.8059
    scalar   int32 4B 123
Data variables:
    foo      (x, y) float64 48B -1.811 0.4769 0.1201 -0.0352 -1.61 -1.287

In [2]: df = ds.to_dataframe()
        xr.Dataset.from_dataframe(df)
Out[2]:
<xarray.Dataset> Size: 152B
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int32 8B 10 20
  * y        (y) object 24B 'a' 'b' 'c'
Data variables:
    foo      (x, y) float64 48B -1.811 0.4769 0.1201 -0.0352 -1.61 -1.287
    along_x  (x, y) float64 48B -0.03376 -0.03376 -0.03376 0.8059 0.8059 0.8059
    scalar   (x, y) int32 24B 123 123 123 123 123 123

/

Describe the solution you'd like

If we analyse the relationships between columns, we can distinguish between variables, dims coordinates and non_dims coordinates.

In the example above, the round-trip conversion with npd return also the same dataset:

In [3]: df.npd.to_xarray()
Out[3]: 
<xarray.Dataset> Size: 88B
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int32 8B 10 20
  * y        (y) <U1 12B 'a' 'b' 'c'
    along_x  (x) float64 16B -0.03376 0.8059
    scalar   int32 4B 123
Data variables:
    foo      (x, y) float64 48B -1.811 0.4769 0.1201 -0.0352 -1.61 -1.287

Note:

Describe alternatives you've considered

Three options are available to you to have an efficient converter,

It seems to me that the option 3 is complex. The option 1 and option 2 are possible

Additional context

The analysis (package tab_analysis) applied to the example above gives the results below:

In [4]: analys = df.reset_index().npd.analysis(distr=True)
        analys.partitions()
Out[4]: [['x', 'y'], ['foo']] # two partitions (dims) are found

In [5]: analys.field_partition() # use the first partition : ['x', 'y']
Out[5]: 
{'primary': ['x', 'y'],
 'secondary': ['along_x'],
 'mixte': [],
 'unique': ['scalar'],
 'variable': ['foo']}

In [6]: analys.relation_partition()
Out[6]: {'x': ['x'], 'y': ['y'], 'along_x': ['x'], 'scalar': [], 'foo': ['x', 'y']}
max-sixty commented 5 months ago

This looks very cool!

I think the first thing we could do is add a link to the library from the documentation — at least the from_dataframe method...

loco-philippe commented 5 months ago

@max-sixty

Thank-you Maximilian for your quick response !

Yes it's a good idea, do you need any additional information for this ?

By the way, i'm looking to see if another theory of tabular structure analysis (see presentation) exists but I can't find references. Do you have some contacts or some references about that ?

max-sixty commented 5 months ago

Yes it's a good idea, do you need any additional information for this ?

This would be a PR you could make to the docs!

loco-philippe commented 5 months ago

OK, that's perfect!

I will prepare a modification of the 'doc/user-guide/pandas.rst' file and then include it in a PR.

Can you confirm that it is not necessary to create a development environment?

max-sixty commented 5 months ago

Can you confirm that it is not necessary to create a development environment?

No it shouldn't be required!