manual benchmark to crunch dataypes

encrypted-soul commented 3 years ago

Brief Description

Contains a .ipynb that with a function to crunch datatypes. Was able to crunch datatypes of size 565.2839660644531 MB to 349.76945400238037 around 61.875% of the original size. The compression is lossless.

Edit @erwanp : file visible here https://github.com/radis/radis-benchmark/blob/e512ecfcdd6d399f10d2fca19d9b0ab6ae475904/manual_benchmarks/test_datatypes_crunch.ipynb

Proposed solution for implementation in Radis

Pass the pandas data frame through this function before returning it.

erwanp commented 3 years ago

Very promising !!

I'm surprised by this one :

******************************
Column:  int
dtype before:  float64
dtype after:  uint8
******************************

int is the (unscaled) linestrength. i think it was caught by your [min/max] condition :

if mn >= 0:
                    if mx < 255:
                        props[col] = props[col].astype(np.uint8)

Linestrength values are very low but definitly not integers.

I'm quite sure such a change breaks the spectrum.

We can also hardcode the type change based on the column name ; given that the number of supported databases remains limited. This may be more robust eventually. Your function becomes useful to have an idea of which type to use.

Do you know how to check for accuracy ? You may need to override the sf.df0 loaded after a SpectrumFactory.load_databank. Usually overwriting sf.df0 is not recommended (because metadata may be lost in the process), but you can try. Look into the transfer_metadata function, or simply copy/paste the .attrs dict ?

encrypted-soul commented 3 years ago

@erwanp "int" was getting converted to uint8 before because of the lines

# test if column can be converted to an integer
asint = props[col].fillna(0).astype(np.int64)
result = (props[col] - asint)
result = result.sum()

# lines with the problem
if result > -0.01 and result < 0.01:
    IsInt = True

I just changed the bound of result to much smaller value and I was able to get a no assertion errors upon checking with assert_frame_equal

assert_frame_equal(df1, df0, check_exact=False, rtol=0.1e-50, atol=0.1e-50)

I would like to proceed with implementing this lossless crunching to the main codebase. The idea would be to crunch all the data types and also resolve the current issues we are facing with parsing quanta (like #280).

I tried to pass the reduce_mem_usage above through parse_global_quanta and parse_local_quanta but I am facing errors at a few places.

These errors aside, how would you like me to implement this in the codebase? Would it be good to hardcode the values of the datatypes and just set the NaN values to -1 and avoid dropping the lines itself as you were suggesting in the discussion of #280 ?

erwanp commented 3 years ago

Hello ! Unfortunately I'll have very limited time this week to provide any insightful answer ; maybe @dcmvdbekerom can help ?

erwanp commented 3 years ago

Do you have more information on the errors you're facing ?

encrypted-soul commented 3 years ago

@erwanp few are just assertions that check the datatype of columns. I don't think this would be an issue.

Other than that, I could conclude from the errors that it is mainly due to the generation of nan values during the calculation of Intensity. I am working on resolving this currently.

erwanp commented 3 years ago

Related : see excellent benchmark f32/f64 in Exojax : https://github.com/HajimeKawahara/exojax/issues/106

radis / radis-benchmark

manual benchmark to crunch dataypes #11

Brief Description

Proposed solution for implementation in Radis