Closed michaelweinold closed 8 months ago
@iamsiddhantsahu, until tomorrow (23.01.2024) please investigate:
Currently, the carculator
ingests data like this:
The input data is formatted in JSON (cf. the default_parameters.json
):
"7-2000-converter mass": {
"name": "converter mass",
"year": 2000,
"powertrain": [
"PHEV-c-d",
"PHEV-e",
"BEV",
"FCEV",
"PHEV-c-p"
],
"sizes": [
"Mini",
"Small",
"Lower medium",
"Medium",
"Large",
"Van",
"Medium SUV",
"Large SUV",
"Micro"
],
"amount": 4.5,
"loc": 4.5,
"minimum": 4,
"maximum": 6,
"kind": "distribution",
"uncertainty_type": 5,
"category": "Powertrain",
"source": "Del Duce et al (2016)",
"comment": ""
},
Of course, we could represent this in tabulated format also:
name | year | powertrain | sizes | amount | ... |
---|---|---|---|---|---|
converter mass | 2000 | PHEV-c-d | Mini | 4.5 | ... |
In the instantiation of the class VehicleInputParameters(NamedParameters)
>> class VehicleInputParameters(NamedParameters)
this data is then converted into an xarray.DataArray
with dimensions:
* size (size) <U12 'Large' 'Large SUV' 'Lower medium' ... 'Small' 'Van'
* powertrain (powertrain) <U8 'BEV' 'FCEV' 'HEV-d' ... 'PHEV-e' 'PHEV-p'
* parameter (parameter) <U64 '1-Pentene direct emissions, rural' ... 'tra...
* year (year) int64 2000 2010 2020 2030 2040 2050
* value (value) int64 0
(where value
is relevant to stochastic calculations only).
This array is then used further on in the model
library to perform calculations.
From what I can see, data is extracted from the DataArray
using the slicing methods.
...so why can't we do this with a Pandas DataFrame?
I wrote a script b218c47 to test the memory usage and slicing times comparing Pandas and XArray with sparse arrays.
Here is the plot.
Findings:
dataset size
is a poor metric. The question was primarily on the number of dimensions, so you should have made that explicit in your plot. Just comparing the dataset size is not helpful at all.
Also, you still have to think about the utility of using an xr.DataArray
vs a pd.DataFrame
. Is there any operation (eg. slicing, etc.) that we could not do with a DataFrame?
The pandas.DataFrame.to_xarray() function that you are using here, converts a pandas.DataFrame object to a xarray.Dataset object. And, the xarray.Dataset
object is a multi-dimensional, in-memory array database. It is a dict-like container of labeled arrays (DataArray objects) with aligned dimensions.
:warning: The important note here is that it is designed as an in-memory representation of the data model -- this might be causing memory issues.
Here is a bar plot comparing the memory usage of Pandas's pandas.DataFrame object and XArrays's xarray.Dataset object.
Dataset 1 = create_sample_dataframe(df_size = 100, num_propulsion_classifications = 1)
Dataset 2 = create_sample_dataframe(df_size = 100, num_propulsion_classifications = 2)
Dataset 3 = create_sample_dataframe(df_size = 500, num_propulsion_classifications = 2)
Findings:
Try:
4 dimensions (eg.
parameter
,time
,size
,prop
) 8 dimensions (eg.parameter
,time
,size
,prop1
,prop2
,prop3
,prop4
,prop5
)and determine size (in memory) and performance when eg. slicing the array.
See also: