sustainableaviation / EcoPyLot

🍃🛩️ Prospective environmental and economic life cycle assessment of aircraft made blazing fast
http://ecopylot.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
2 stars 0 forks source link

Investigate (Sparse) XArray Sizes depending on Dimensions #8

Closed michaelweinold closed 8 months ago

michaelweinold commented 8 months ago

Try:

4 dimensions (eg. parameter, time, size, prop) 8 dimensions (eg. parameter, time, size, prop1, prop2, prop3, prop4, prop5)

and determine size (in memory) and performance when eg. slicing the array.

See also:

michaelweinold commented 8 months ago

@iamsiddhantsahu, until tomorrow (23.01.2024) please investigate:

Currently, the carculator ingests data like this:

The input data is formatted in JSON (cf. the default_parameters.json):

"7-2000-converter mass": {
        "name": "converter mass",
        "year": 2000,
        "powertrain": [
            "PHEV-c-d",
            "PHEV-e",
            "BEV",
            "FCEV",
            "PHEV-c-p"
        ],
        "sizes": [
            "Mini",
            "Small",
            "Lower medium",
            "Medium",
            "Large",
            "Van",
            "Medium SUV",
            "Large SUV",
            "Micro"
        ],
        "amount": 4.5,
        "loc": 4.5,
        "minimum": 4,
        "maximum": 6,
        "kind": "distribution",
        "uncertainty_type": 5,
        "category": "Powertrain",
        "source": "Del Duce et al (2016)",
        "comment": ""
    },

Of course, we could represent this in tabulated format also:

name year powertrain sizes amount ...
converter mass 2000 PHEV-c-d Mini 4.5 ...

In the instantiation of the class VehicleInputParameters(NamedParameters) >> class VehicleInputParameters(NamedParameters) this data is then converted into an xarray.DataArray with dimensions:

  * size        (size) <U12 'Large' 'Large SUV' 'Lower medium' ... 'Small' 'Van'
  * powertrain  (powertrain) <U8 'BEV' 'FCEV' 'HEV-d' ... 'PHEV-e' 'PHEV-p'
  * parameter   (parameter) <U64 '1-Pentene direct emissions, rural' ... 'tra...
  * year        (year) int64 2000 2010 2020 2030 2040 2050
  * value       (value) int64 0

(where value is relevant to stochastic calculations only).

This array is then used further on in the model library to perform calculations.

From what I can see, data is extracted from the DataArray using the slicing methods.

...so why can't we do this with a Pandas DataFrame?

iamsiddhantsahu commented 8 months ago

I wrote a script b218c47 to test the memory usage and slicing times comparing Pandas and XArray with sparse arrays.

Here is the plot. pandas-vs-xarray

Findings:

  1. Memory -- Pandas is more memory efficient for sparse arrays than XArray, may be because of the built-in SparseDataFrame class.
  2. Slicing TIme -- Both XArray and Pandas seems to have the same slicing time.
michaelweinold commented 8 months ago

dataset size is a poor metric. The question was primarily on the number of dimensions, so you should have made that explicit in your plot. Just comparing the dataset size is not helpful at all.

Also, you still have to think about the utility of using an xr.DataArray vs a pd.DataFrame. Is there any operation (eg. slicing, etc.) that we could not do with a DataFrame?

iamsiddhantsahu commented 8 months ago

The pandas.DataFrame.to_xarray() function that you are using here, converts a pandas.DataFrame object to a xarray.Dataset object. And, the xarray.Dataset object is a multi-dimensional, in-memory array database. It is a dict-like container of labeled arrays (DataArray objects) with aligned dimensions.

:warning: The important note here is that it is designed as an in-memory representation of the data model -- this might be causing memory issues.

iamsiddhantsahu commented 8 months ago

Here is a bar plot comparing the memory usage of Pandas's pandas.DataFrame object and XArrays's xarray.Dataset object.

Dataset 1 = create_sample_dataframe(df_size = 100, num_propulsion_classifications = 1) Dataset 2 = create_sample_dataframe(df_size = 100, num_propulsion_classifications = 2) Dataset 3 = create_sample_dataframe(df_size = 500, num_propulsion_classifications = 2)

Findings: