Closed ludwiglierhammer closed 4 months ago
@aanderss: Could you please review this PR. Tests for deck 714 are passing now. I had to make some changes to the expected outputs. The inputs remain the same. Please takea look at the changes and confirm if they are correct. I have based the changes on the inputs. After that I can adjust the other test data.
Hi @jtsiddons, may you test your data with this forked branch? This is the branch that belongs to this PR. Is everything working?
Hi @ludwiglierhammer - I have tried to re-run the example_notebooks/create_data_model.ipynb
notebook. It fails when trying to read with the schema created in the notebook:
The notebook takes an existing schema, adds a few fields to the C99 section, then saves the schema (and a copy of the original code tables) to a temporary directory (the my_model_path
value passed to data_model_path
).
data_new = mdf_reader.read(data.get("source"), data_model_path=my_model_path)
with output:
2024-06-28 13:44:46,381 - root - INFO - READING DATA MODEL SCHEMA FILE...
2024-06-28 13:44:46,383 - root - INFO - EXTRACTING DATA FROM MODEL: /tmp/tmpcl79bgrn/imma1_d701
2024-06-28 13:44:46,384 - root - INFO - Getting data string from source...
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[14], line 1
----> 1 data_new = mdf_reader.read(data.get("source"), data_model_path=my_model_path)
File /tmp/cdm_reader_mapper/cdm_reader_mapper/mdf_reader/read.py:317, in read(source, data_model, data_model_path, **kwargs)
283 """Read data files compliant with a user specific data model.
284
285 Reads a data file to a pandas DataFrame using a pre-defined data model.
(...)
309 from ``source``.
310 """
311 logging.basicConfig(
312 format="%(levelname)s\t[%(asctime)s](%(filename)s)\t%(message)s",
313 level=logging.INFO,
314 datefmt="%Y%m%d %H:%M:%S",
315 filename=None,
316 )
--> 317 return MDFFileReader(
318 source=source, data_model=data_model, data_model_path=data_model_path
319 ).read(**kwargs)
File /tmp/cdm_reader_mapper/cdm_reader_mapper/mdf_reader/read.py:236, in MDFFileReader.read(self, chunksize, sections, skiprows, out_path, convert, decode, converter_dict, converter_kwargs, validate, **kwargs)
233 # 2.2 Homogenize input data to an iterable with dataframes:
234 # a list with a single dataframe or a pd.io.parsers.TextFileReader
235 logging.info("Getting data string from source...")
--> 236 self.configurations = self._get_configurations(read_sections_list, sections)
237 self.data, self.isna = self._open_data(
238 read_sections_list,
239 sections,
(...)
242 chunksize=chunksize,
243 )
244 ## 2.3. Extract, read and validate data in same loop
245 # logging.info("Extracting and reading sections")
File /tmp/cdm_reader_mapper/cdm_reader_mapper/mdf_reader/utils/auxiliary.py:421, in _FileReader._get_configurations(self, order, valid)
418 def _get_configurations(self, order, valid):
419 config_dict = Configurator(
420 schema=self.schema, order=order, valid=valid
--> 421 ).get_configuration()
422 for attr, val in config_dict["self"].items():
423 setattr(self, attr, val)
File /tmp/cdm_reader_mapper/cdm_reader_mapper/mdf_reader/utils/auxiliary.py:232, in Configurator.get_configuration(self)
230 if converters:
231 convert[index] = converters
--> 232 conv_kwargs = self._get_conv_kwargs()
233 if conv_kwargs:
234 kwargs[index] = conv_kwargs
File /tmp/cdm_reader_mapper/cdm_reader_mapper/mdf_reader/utils/auxiliary.py:189, in Configurator._get_conv_kwargs(self)
187 if column_type is None:
188 return {}
--> 189 return {
190 converter_arg: self.sections_dict.get(converter_arg)
191 for converter_arg in properties.data_type_conversion_args.get(column_type)
192 }
TypeError: 'NoneType' object is not iterable
I don't have access to my other testing files at present - I am on leave until Tuesday morning and can test them when I am back in the office.
@ludwiglierhammer - The notebook ran correctly when I changed the data-types that I had set in the custom component of the schema, following your commits.
@ludwiglierhammer - The notebook ran correctly when I changed the data-types that I had set in the custom component of the schema, following your commits.
@jtsiddons: Pretty good 😃. For testing reasons you can make a fork of my PR branch. Then you don't have to copy my commits manually into your main fork.
@jtsiddons: I am really happy that you keep taking a look at the notebooks. Thank you for your help!
Attention: Patch coverage is 96.77419%
with 1 line
in your changes missing coverage. Please review.
Project coverage is 83.86%. Comparing base (
2cfe674
) to head (2ee13bb
). Report is 1 commits behind head on main.
Files | Patch % | Lines |
---|---|---|
cdm_reader_mapper/cdm_mapper/table_writer.py | 92.85% | 1 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@aanderss: the testing suite is passing again with some adjustements in the testdata 🎉 .
@ludwiglierhammer: Is it worth aliasing the original dtypes (uint8
, float32
, etc) to the default types when reading the schema.
It would allow for parsing of the original schema. Something along the lines of:
if "int" in schema["sections"][section]["elements"][element].get("column_type").lower():
schema["sections"][section]["elements"][element].update(
{"column_type": "int"}
)
Similar for float?
If a user had an older schema that contains the original column_type
values then they would get the TypeError: 'NoneType' object is not iterable
error, which may not be helpful to them. Could also show a warning that use of the original dtypes will be deprecated and state that int
or float
is used instead.
@ludwiglierhammer: Is it worth aliasing the original dtypes (
uint8
,float32
, etc) to the default types when reading the schema.It would allow for parsing of the original schema. Something along the lines of:
if "int" in schema["sections"][section]["elements"][element].get("column_type").lower(): schema["sections"][section]["elements"][element].update( {"column_type": "int"} )
Similar for float?
If a user had an older schema that contains the original
column_type
values then they would get theTypeError: 'NoneType' object is not iterable
error, which may not be helpful to them. Could also show a warning that use of the original dtypes will be deprecated and state thatint
orfloat
is used instead.
@jtsiddons: When reading the schema I added a function to convert the data type to the default data types
def convert_dtype_to_default(dtype, section, element):
"""Convert data type to defaults (int, float)."""
if dtype is None:
return
elif dtype == "float":
return dtype
elif dtype == "int":
return properties.pandas_int
elif "float" in dtype.lower():
logging.warning(
f"Set column type of ({section}, {element}) from deprecated {dtype} to float."
)
return "float"
elif "int" in dtype.lower():
logging.warning(
f"Set column type of ({section}, {element}) from deprecated {dtype} to int."
)
return properties.pandas_int
return dtype