signavio / sap-sam

Example source code for SAP Signavio Academic Models (SAP-SAM)
Apache License 2.0
28 stars 10 forks source link

Running analysis on data subsets #66

Open dubmix opened 4 months ago

dubmix commented 4 months ago

Noticed some unexpected behaviour in the code after trying to run the analysis on a very small subset of the data (~25 models).

dubmix commented 4 months ago

In the modelling notation part, this line was causing an issue: df_meta_selected = df_meta_selected.groupby('namespace').resample('Y').sum(numeric_only=True).reset_index()

Screenshot 2024-02-16 at 15 58 04 (2)

As we can see in the above picture, I found out the particular case of the count being 0 for a specific date, a duplicate namespace column filled with NaN values would be added to the dataframe.

To palliate this issue, I added the min_count option and subsequently filled the NaNs with 0.

dubmix commented 4 months ago

In the element types section, the small quantity of models used for the analysis raised another issue. After the data crunching, Seaborn interprets the dataframe with the original number of rows instead of the actual one. This leads to an error throw in the form of: AttributeError: 'NoneType' object has no attribute 'get_bbox'

This part of the error message gives us a hint: The palette list has fewer values (18) than needed (27) and will cycle, which may produce an uninterpretable plot.

18 containers are expected but the list has 27 containers, hence the error.

The current solution is to remove hue="category" under an arbitrary threshold of models. Seems like this particular issue only applies to small data subsets. Potential fix could be updating pandas to v2.