glamod / cdm_reader_mapper

Common Data Model reader and mapper toolbox
https://cdm-reader-mapper.readthedocs.io/en/latest/?version=latest
Apache License 2.0
1 stars 5 forks source link

Get rid of data conversion between both integer and float types #60

Closed ludwiglierhammer closed 4 months ago

github-actions[bot] commented 4 months ago

Warning This Pull Request is coming from a fork and must be manually tagged approved in order to perform additional testing.

ludwiglierhammer commented 4 months ago

@aanderss: Could you please review this PR. Tests for deck 714 are passing now. I had to make some changes to the expected outputs. The inputs remain the same. Please takea look at the changes and confirm if they are correct. I have based the changes on the inputs. After that I can adjust the other test data.

ludwiglierhammer commented 4 months ago

Hi @jtsiddons, may you test your data with this forked branch? This is the branch that belongs to this PR. Is everything working?

jtsiddons commented 4 months ago

Hi @ludwiglierhammer - I have tried to re-run the example_notebooks/create_data_model.ipynb notebook. It fails when trying to read with the schema created in the notebook:

The notebook takes an existing schema, adds a few fields to the C99 section, then saves the schema (and a copy of the original code tables) to a temporary directory (the my_model_path value passed to data_model_path).

data_new = mdf_reader.read(data.get("source"), data_model_path=my_model_path)

with output:

2024-06-28 13:44:46,381 - root - INFO - READING DATA MODEL SCHEMA FILE...
2024-06-28 13:44:46,383 - root - INFO - EXTRACTING DATA FROM MODEL: /tmp/tmpcl79bgrn/imma1_d701
2024-06-28 13:44:46,384 - root - INFO - Getting data string from source...

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[14], line 1
----> 1 data_new = mdf_reader.read(data.get("source"), data_model_path=my_model_path)

File /tmp/cdm_reader_mapper/cdm_reader_mapper/mdf_reader/read.py:317, in read(source, data_model, data_model_path, **kwargs)
    283 """Read data files compliant with a user specific data model.
    284 
    285 Reads a data file to a pandas DataFrame using a pre-defined data model.
   (...)
    309     from ``source``.
    310 """
    311 logging.basicConfig(
    312     format="%(levelname)s\t[%(asctime)s](%(filename)s)\t%(message)s",
    313     level=logging.INFO,
    314     datefmt="%Y%m%d %H:%M:%S",
    315     filename=None,
    316 )
--> 317 return MDFFileReader(
    318     source=source, data_model=data_model, data_model_path=data_model_path
    319 ).read(**kwargs)

File /tmp/cdm_reader_mapper/cdm_reader_mapper/mdf_reader/read.py:236, in MDFFileReader.read(self, chunksize, sections, skiprows, out_path, convert, decode, converter_dict, converter_kwargs, validate, **kwargs)
    233 # 2.2 Homogenize input data to an iterable with dataframes:
    234 # a list with a single dataframe or a pd.io.parsers.TextFileReader
    235 logging.info("Getting data string from source...")
--> 236 self.configurations = self._get_configurations(read_sections_list, sections)
    237 self.data, self.isna = self._open_data(
    238     read_sections_list,
    239     sections,
   (...)
    242     chunksize=chunksize,
    243 )
    244 ## 2.3. Extract, read and validate data in same loop
    245 # logging.info("Extracting and reading sections")

File /tmp/cdm_reader_mapper/cdm_reader_mapper/mdf_reader/utils/auxiliary.py:421, in _FileReader._get_configurations(self, order, valid)
    418 def _get_configurations(self, order, valid):
    419     config_dict = Configurator(
    420         schema=self.schema, order=order, valid=valid
--> 421     ).get_configuration()
    422     for attr, val in config_dict["self"].items():
    423         setattr(self, attr, val)

File /tmp/cdm_reader_mapper/cdm_reader_mapper/mdf_reader/utils/auxiliary.py:232, in Configurator.get_configuration(self)
    230 if converters:
    231     convert[index] = converters
--> 232 conv_kwargs = self._get_conv_kwargs()
    233 if conv_kwargs:
    234     kwargs[index] = conv_kwargs

File /tmp/cdm_reader_mapper/cdm_reader_mapper/mdf_reader/utils/auxiliary.py:189, in Configurator._get_conv_kwargs(self)
    187 if column_type is None:
    188     return {}
--> 189 return {
    190     converter_arg: self.sections_dict.get(converter_arg)
    191     for converter_arg in properties.data_type_conversion_args.get(column_type)
    192 }

TypeError: 'NoneType' object is not iterable

I don't have access to my other testing files at present - I am on leave until Tuesday morning and can test them when I am back in the office.

jtsiddons commented 4 months ago

@ludwiglierhammer - The notebook ran correctly when I changed the data-types that I had set in the custom component of the schema, following your commits.

ludwiglierhammer commented 4 months ago

@ludwiglierhammer - The notebook ran correctly when I changed the data-types that I had set in the custom component of the schema, following your commits.

@jtsiddons: Pretty good 😃. For testing reasons you can make a fork of my PR branch. Then you don't have to copy my commits manually into your main fork.

ludwiglierhammer commented 4 months ago

@jtsiddons: I am really happy that you keep taking a look at the notebooks. Thank you for your help!

codecov[bot] commented 4 months ago

Codecov Report

Attention: Patch coverage is 96.77419% with 1 line in your changes missing coverage. Please review.

Project coverage is 83.86%. Comparing base (2cfe674) to head (2ee13bb). Report is 1 commits behind head on main.

Files Patch % Lines
cdm_reader_mapper/cdm_mapper/table_writer.py 92.85% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #60 +/- ## ========================================== - Coverage 83.97% 83.86% -0.12% ========================================== Files 61 61 Lines 2578 2553 -25 ========================================== - Hits 2165 2141 -24 + Misses 413 412 -1 ``` | [Flag](https://app.codecov.io/gh/glamod/cdm_reader_mapper/pull/60/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=glamod) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/glamod/cdm_reader_mapper/pull/60/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=glamod) | `83.86% <96.77%> (-0.12%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=glamod#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ludwiglierhammer commented 4 months ago

@aanderss: the testing suite is passing again with some adjustements in the testdata 🎉 .

jtsiddons commented 4 months ago

@ludwiglierhammer: Is it worth aliasing the original dtypes (uint8, float32, etc) to the default types when reading the schema.

It would allow for parsing of the original schema. Something along the lines of:

if "int" in schema["sections"][section]["elements"][element].get("column_type").lower():
    schema["sections"][section]["elements"][element].update(
        {"column_type": "int"}
    )

Similar for float?

If a user had an older schema that contains the original column_type values then they would get the TypeError: 'NoneType' object is not iterable error, which may not be helpful to them. Could also show a warning that use of the original dtypes will be deprecated and state that int or float is used instead.

ludwiglierhammer commented 4 months ago

@ludwiglierhammer: Is it worth aliasing the original dtypes (uint8, float32, etc) to the default types when reading the schema.

It would allow for parsing of the original schema. Something along the lines of:

if "int" in schema["sections"][section]["elements"][element].get("column_type").lower():
    schema["sections"][section]["elements"][element].update(
        {"column_type": "int"}
    )

Similar for float?

If a user had an older schema that contains the original column_type values then they would get the TypeError: 'NoneType' object is not iterable error, which may not be helpful to them. Could also show a warning that use of the original dtypes will be deprecated and state that int or float is used instead.

@jtsiddons: When reading the schema I added a function to convert the data type to the default data types

def convert_dtype_to_default(dtype, section, element):
    """Convert data type to defaults (int, float)."""
    if dtype is None:
        return
    elif dtype == "float":
        return dtype
    elif dtype == "int":
        return properties.pandas_int
    elif "float" in dtype.lower():
        logging.warning(
            f"Set column type of ({section}, {element}) from deprecated {dtype} to float."
        )
        return "float"
    elif "int" in dtype.lower():
        logging.warning(
            f"Set column type of ({section}, {element}) from deprecated {dtype} to int."
        )
        return properties.pandas_int
    return dtype