Open loco-philippe opened 5 months ago
This looks very cool!
I think the first thing we could do is add a link to the library from the documentation — at least the from_dataframe
method...
@max-sixty
Thank-you Maximilian for your quick response !
Yes it's a good idea, do you need any additional information for this ?
By the way, i'm looking to see if another theory of tabular structure analysis (see presentation) exists but I can't find references. Do you have some contacts or some references about that ?
Yes it's a good idea, do you need any additional information for this ?
This would be a PR you could make to the docs!
OK, that's perfect!
I will prepare a modification of the 'doc/user-guide/pandas.rst' file and then include it in a PR.
Can you confirm that it is not necessary to create a development environment?
Can you confirm that it is not necessary to create a development environment?
No it shouldn't be required!
Is your feature request related to a problem?
The current
xarray.Dataset.from_dataframe
method converts DataFrame columns corresponding to non-index coordinates into variables as explained in the user-guide.This solution is not optimal because it does not recover the structure of the initial data. It also creates large datasets.
The user-guide example is below:
/
Describe the solution you'd like
If we analyse the relationships between columns, we can distinguish between variables, dims coordinates and non_dims coordinates.
In the example above, the round-trip conversion with
npd
return also the same dataset:Note:
npd
is the ntv_pandas package (present in the pandas ecosystem). This package is capable of converting complex DataFrame (see examples).Describe alternatives you've considered
Three options are available to you to have an efficient converter,
xarray.Dataset.from_dataframe
and use thenpd
third-party solution to have an optimized converteranalysis
package to find dims, coordinates and variables, then modify thexarray.Dataset.from_dataframe
method to generate a dataset,analysis
functions in thexarray.Dataset.from_dataframe
methodIt seems to me that the option 3 is complex. The option 1 and option 2 are possible
Additional context
The
analysis
(package tab_analysis) applied to the example above gives the results below: