Open encrypted-soul opened 3 years ago
Very promising !!
I'm surprised by this one :
******************************
Column: int
dtype before: float64
dtype after: uint8
******************************
int
is the (unscaled) linestrength.
i think it was caught by your [min/max] condition :
if mn >= 0:
if mx < 255:
props[col] = props[col].astype(np.uint8)
Linestrength values are very low but definitly not integers.
I'm quite sure such a change breaks the spectrum.
We can also hardcode the type change based on the column name ; given that the number of supported databases remains limited. This may be more robust eventually. Your function becomes useful to have an idea of which type to use.
Do you know how to check for accuracy ? You may need to override the sf.df0
loaded after a SpectrumFactory.load_databank. Usually overwriting sf.df0
is not recommended (because metadata may be lost in the process), but you can try. Look into the transfer_metadata function, or simply copy/paste the .attrs dict ?
@erwanp "int" was getting converted to uint8
before because of the lines
# test if column can be converted to an integer
asint = props[col].fillna(0).astype(np.int64)
result = (props[col] - asint)
result = result.sum()
# lines with the problem
if result > -0.01 and result < 0.01:
IsInt = True
I just changed the bound of result to much smaller value and I was able to get a no assertion errors upon checking with assert_frame_equal
assert_frame_equal(df1, df0, check_exact=False, rtol=0.1e-50, atol=0.1e-50)
I would like to proceed with implementing this lossless crunching to the main codebase. The idea would be to crunch all the data types and also resolve the current issues we are facing with parsing quanta (like #280).
I tried to pass the reduce_mem_usage
above through parse_global_quanta
and parse_local_quanta
but I am facing errors at a few places.
These errors aside, how would you like me to implement this in the codebase? Would it be good to hardcode the values of the datatypes and just set the NaN values to -1 and avoid dropping the lines itself as you were suggesting in the discussion of #280 ?
Hello ! Unfortunately I'll have very limited time this week to provide any insightful answer ; maybe @dcmvdbekerom can help ?
Do you have more information on the errors you're facing ?
@erwanp few are just assertions that check the datatype of columns. I don't think this would be an issue.
Other than that, I could conclude from the errors that it is mainly due to the generation of nan
values during the calculation of Intensity
. I am working on resolving this currently.
Related : see excellent benchmark f32/f64 in Exojax : https://github.com/HajimeKawahara/exojax/issues/106
Brief Description
Contains a .ipynb that with a function to crunch datatypes. Was able to crunch datatypes of size
565.2839660644531 MB
to349.76945400238037
around61.875%
of the original size. The compression is lossless.Edit @erwanp : file visible here https://github.com/radis/radis-benchmark/blob/e512ecfcdd6d399f10d2fca19d9b0ab6ae475904/manual_benchmarks/test_datatypes_crunch.ipynb
Proposed solution for implementation in Radis
Pass the pandas data frame through this function before returning it.