refactor(datafile): use len(obj) rather than obj.get_nrecords()

mwtoews commented 2 months ago

This PR has a few aims related to data files, including FormattedHeadFile, HeadFile and CellBudgetFile.

These files have a "number of records" property that was implemented with get_nrecords(). This "length of the object" measure is more naturally done with __len__, i.e. len(headsfile).

It is advised to prefer the len() approach, so instances of get_nrecords() for these files show a DeprecationWarning.

The CellBudgetFile also has a .nrecords property. It is also advised to show a DeprecationWarning with this property.

This PR also fixes a bug with get_nrecords() shown here: https://github.com/modflowpy/flopy/blob/ea3e475c2a566872670d067b7c611861d4be8c54/flopy/utils/datafile.py#L427-L430

the bug is that recordarray is a structured array (np.ndarray), not a record array, so this always silently returns 0. This bug does not apply to CellBudgetFile, which worked fine and matches obj.nrecords.

Fixing this bug caused this test to fail, since the for-loop was never activated. A "todo" note is added since the reversed header is not the same as the original header.

Another aim of this PR is to re-organize a few CellBudgetFile tests from test_binaryfile.py to test_cellbudgetfile.py. Most of this is copied with perhaps minor simplifications.

Note that none of these changes apply to flopy.utils.swroutputfile.SwrBudget.get_nrecords(), which returns a tuple.

There is also still room for discussion if get_nrecords() or a dynamic .nrecords property should be preferred over len(obj). This PR can be adjusted accordingly. Opinions welcome!

langevin-usgs commented 2 months ago

I'm in favor of len(obj). Sadly, len doesn't correspond to the number of times the data is available, but rather the number of records, which are written by layer with DIS/DISV. Whatever we can do to make our objects as pythonic as possible seems to be a step in the right direction. The recordarray misnomer is something that has been propagated for years due to a misunderstanding on our (my) part regarding the differences between structured arrays and record arrays -- something to keep in mind with a needed refactoring.

wpbonelli commented 2 months ago

Sadly, len doesn't correspond to the number of times the data is available, but rather the number of records, which are written by layer with DIS/DISV.

Agreed len() suggests to me the number of rows in tabular data, as in numpy and pandas. A couple thoughts on this (hopefully not too off-base — maybe I have misunderstood something)

Flopy's approach to model outputs is to provide classes wrapping MODFLOW/etc output files, which may contain several tables of different shapes. This could be seen as a leaky abstraction. Does flopy really need to reproduce MODFLOW output file structure in user-facing APIs?
Flopy uses "record" to mean "table of some kind of data", where it commonly means a single row of tabular data, so "number of records"/"length of the object" really means "number of separate tables in this output file"

Maybe a future flopy could consider static routines to read model outputs e.g.

flopy.read_binary_file("path/to/file.cbc", data="FLOW RIGHT FACE")

File classes could live on under the hood, e.g. for indexing/caching, precision detection, etc, but it seems ideal to try to hide this and return one table at a time? The user would still need to know which sort of data can be found in which output file, but even that seems like it could be hidden with suitable shortcuts on model.output.

In the meantime I like len(obj) as sugar for get_nrecords() though I wonder if there is some way to emphasize this is not a row (i.e. time) count

mwtoews commented 2 months ago

Reply to a few points of discussion...

Agreed len() suggests to me the number of rows in tabular data, as in numpy and pandas.

Xref to #2221 which adds a .headers data frame with this number of headers.

Maybe a future flopy could consider static routines to read model outputs e.g.
flopy.read_binary_file("path/to/file.cbc", data="FLOW RIGHT FACE")

With CellBudget files, a somewhat expensive component is to build an index of headers, which is ideally done once, then re-used to get data from more than one component. This process wouldn't work well with static routines.

In the meantime I like len(obj) as sugar for get_nrecords() though I wonder if there is some way to emphasize this is not a row (i.e. time) count

I was considering keeping/adding obj.nrecords to be the same as len(obj), but the docstrings for __len__ should be sufficient without adding another property.

wpbonelli commented 2 months ago

@mwtoews can you resolve the datafile.py conflict? then we can bring this in

modflowpy / flopy

refactor(datafile): use len(obj) rather than obj.get_nrecords() #2215