Closed dpole closed 3 years ago
About the benefits, 1. is a very popular idea in the c++ community, but I don't think it has practical advantages so I wouldn't use it as an argument. About 2, true, but I wouldn't consider it a driver. About 3, it does not sound so different from providing a list of names that people should conform with. About 5, it is a hack to try to overcome the problems introduced by 4
Thanks for having opened this, @dpole! Before discussing your list, it's better to state a few facts:
The Detector
class is meant to closely match the IMO quantity (detector) info
: there should be a one-to-one relationship to the metadata in the IMO quantity and the fields in the Detector
class. (This is actually the case with the current implementation in master
.)
To remark the association, Detector
should probably be renamed to DetectorInfo
.
The fields in the IMO quantity have been chosen according to the toast-litebird
repository, because they have been used in the TOAST simulations done so far for LiteBIRD.
Detector
should really be a dataclass, but unfortunately Python 3.6 does not support them. However, I discovered that dataclasses were inspired by the attrs library, which is available in Python 3.6. Another possibility is to use the dataclasses backport, which is compatible with Python 3.6. (It does not work in Python 3.7+, but Poetry permits to restrict dependencies to specific Python versions.) It's my desire to use either one or another for Detector
, as I shall detail at the end of this post.
(The advantage of the dataclasses
library is that, once we drop support for Python 3.6, we can remove the dependency altogether. OTOH, attrs
has a few more nice features like validators, but its syntax might look more alien in the future, when everybody will be accustomed to PEP 557.)
Here are my comments about the points you listed above:
The importance of using classes in specific places can bring several advantages, which are relevant even outside the C++ community (in fact, my reference to C++ was just a joke). As an instance, the Django framework forces you to build your data model using objects. This is relevant, because it's the way I implemented entities, quantities, data files, etc. in instrumentdb
, e.g.:
# Abridged
class DataFile(models.Model):
uuid = models.UUIDField(...)
name = models.CharField(...)
upload_date = models.DateTimeField(...)
metadata = models.TextField(...)
file_data = models.FileField(...)
quantity = models.ForeignKey(...)
spec_version = models.CharField(...)
dependencies = models.ManyToManyField(...)
plot_file = models.FileField(...)
...
Django is one of the most famous Python frameworks in the world, so I believe that we can consider the usage of classes in Python to be fully «pythonic». This being said, classes have several advantages:
The type of a parameter is self-evident in declarations; compare
def get_cosmic_ray_hit_ratio(det: Detector):
...
with
# Which fields am I expected to pass here?!?
def get_cosmic_ray_hit_ratio(det: Dict[str, Any]):
...
The type of the class' fields is explicit:
class Detector:
fknee_mhz: float
pixel: int
...
This is great for self-documentation (modern editors are even able to provide type hints while you're writing code), and it allows code checkers like mypy
, pyre-check
, pyright
, or pytype
to spot type errors. Example:
# True story: I really made this mistake a few weeks ago
Detector = det.from_imo(imo, ...)
det.pixel = "lft-002" # Force a different pixel
Using mypy
, you get:
$ mypy myscript.py
myscript.py:7: error: Incompatible types in assignment (expression has type "str", variable has type "int")
We could even enforce correct types if we employ the attrs
library; see the page about validators, which shows that you can raise a runtime error if you call Detector(pixel="a")
and the type of pixel
is an integer.
The __init__
constructor can initialize each value to a sensible default and check for inconsistencies and wrong numbers (e.g., an instrument name that does not match lft
, mft
, or hft
). The current implementation of Detector.__init__
already does something of this sort: if you do not provide a quaternion, it will use the tuple 0, 0, 0, 1
, which corresponds to a detector aligned with the boresight. Another nice addition would be to automatically normalize the quaternion to enforce the requirement that it always represents a rotation, etc.
In some cases it makes sense that functions that operate on members of the class be turned into methods of that class. Consider this:
def get_cosmic_ray_hit_ratio(det: Detector) -> float:
...
versus this:
class Detector:
....
def get_cosmic_ray_hit_ratio(self) -> float:
...
The biggest advantage is discoverability: if I don't remember how to compute the ratio of cosmic ray hits while I'm coding in a Jupyter notebook, I can just write
det = Detector()
det.█ # The cursor is here
and pressing TAB I'll see what methods and fields are available, instead of digging through all the pages of documentation.
Thus, I believe that classes bring several usability advantages over dictionaries.
It depends on the workflow, but I agree that it's not a killer feature
For me the feature is important. First, it enables auto-completion in modern editors:
det = Detector(█ # I want to specify the knee frequency: what should I write?
If I don't remember the name of the parameter (is fknee_mhz
, frequency_knee_mhz
or fknee_mHz
?), pressing TAB will present the alternatives and complete the name for me. OTOH, if I rewrite the statement above using dictionaries, I either need to read the documentation or run the code and wait for runtime errors later (or wrong values in the output!):
det = { "fknee_mHz": 10.0 } # Oh no, I mispelled it!
Moreover, code checkers have a chance to spot bugs here as well:
# Another true story from not much time ago...
# I keep doing the same errors over and over again
det = Detector.from_imo(imo, "...")
det.fknee_mHz *= 2 # Typo here!
Using a type checker like mypy
causes the following error:
$ mypy myscript.py
myscript.py:7: error: "Detector" has no attribute "fknee_mHz"; maybe "fknee_mhz"?
None of thes features is achievable through a list of names in the documentation.
Nothing to add
It's not true that everything is defaulted to None
, because for instance the quat
field is set to the unit quaternion (0, 0, 0, 1
), which represents the boresight (see above). I believe that the ability to specify different defaults and have complex initialization code (e.g., quaternion normalization) is not a «hack» but a very handy feature. For instance, we might decide to automatically initialize the instrument name and pixel number whenever the detector's name is passed explicitly and matches some regexp:
det = litebird_sim.Detector("L00_003_AQ_040T")
print(det.wafer) # Prints "L00"
print(det.pixel) # Prints 3
There is no clean way to do this using plain dictionaries, because dictionaries do not have constructors. To me it's important to provide a handy way to properly initialize Detector
instances, as this will be an activity that 99% of the users will do in all their scripts.
There is another point that is missing in your list: I believe that using dictionaries is not pythonic in this case. IMO, dictionaries should be used whenever the feature to add/remove arbitrary keys at runtime is important, because you cannot reasonably expect what it is going to be put inside your variable (e.g., you're reading a JSON file provided by the user). But the fields in a Detector
class are already settled, both in their name and type: they are all and only the metadata in the quantity (detector) info
, which are listed in the IMO Format Specification Document (i.e., outside our framework) and are going to be put in the IMO Specification Document as well. As I said at the top, the purpose of the Detector
class is to provide a one-to-one match with the IMO quantity, and thus we will never add/remove fields to it at runtime. (This is one of the reasons why Django's ORM uses classes instead of dictionaries to model database entities: they are not expected to change at runtime, and if you want to add/remove fields you must stop the program, run a migration, and start the server again.)
Here is my take regarding the list of complications (sorry, they should have been listed with letters instead of numbers):
The fields in Detector
closely match the metadata in the info
quantity in the IMO database. When things will be streamlined, to add a new field to an info
quantity you will first need to update the Format Specification Document, have it agreed by the IMO team, and then update the instrument database. I don't expect we'll have to do this often. Moreover, you cannot store more than a few kb of data in the metadata
field of data files in the IMO database, because it's a string field in a database table. There aren't going to be «a lot» of parameters to be included in the Detector
class: an upper bound might be 20–30. (Other stuff will go in dedicated IMO quantities, like grasp_beam
or bandshape
, which will probably deserve their own Python classes, with a classmethod constructor named from_imo
like for Detector
.)
Since Detector
will always have a low memory footprint, I see no significant objection to carry around fields that the user does not plan to use at the moment.
Finally, if we do not adopt this idea, how are users supposed to read detector's data from the info
quantities in the IMO? Currently, the IMO database is a all-or-nothing thing: once you query the metadata
field of a info
quantity you're reading all the metadata in memory. There is no way to avoid this, because it's related to the way tables are initialized and accessed by SQL commands. The only ways I can put together do not look pleasant at all:
# First take: ugly!
det_name = "L00_003_AQ_040T"
imo_path = f"/releases/v1.0/satellite/LFT/L1-040/{det_name}/info"
det = {
"pixel": imo.query_metadatum(imo, imo_path, "pixel"),
"wafer": imo.query_metadatum(imo, imo_path, "wafer"),
"net_ukrts": imo.query_metadatum(imo, imo_path, "net_ukrts"),
}
# Second take: better, but still complicated (you need to
# remember how to spell "net_ukrts") and not efficient:
# the function "build_from_imo_metadata" queries *all* the
# metadata and then filters out those that are not listed here.
imo_path = f"/releases/v1.0/satellite/LFT/L1-040/{det_name}/info"
det = build_from_imo_metadata(imo, imo_path, fields=[
"pixel",
"wafer",
"net_ukrts",
]) # This should return a dictionary
I find any of the following much more pleasant:
# This works with the code in `master` (returning a Detector object)…
det = Detector.from_imo(
"/releases/v1.0/satellite/LFT/L1-040/L00_003_QA_040T/info",
# …and this works too (returning a dictionary)
det = imo.query("/releases/v1.0/satellite/LFT/L1-040/L00_003_QA_040T/info")["metadata"]
This is what 100% of the users will do whenever they want to read a (detector) info
quantity from the IMO. Yet, in both cases they will be reading all the fields, even if they are not going to use all of them.
I think the opposite, but everybody is entitled to have their own opinions :-)
True, but… «I wouldn't consider this a driver». How many times does a user need to do this? It seems to me that the only case where this is relevant is the code that (re-)distributes detectors among MPI processes in the Observation
class, which might benefit from an easy way to iterate over the keys. I do not expect end users to rely on this very often, but I could possibly miss some cases. Do you have something specific in mind?
In any case, if we use dataclasses, it will be easy to implement this thanks to the asdict
method:
# Assuming that Detector is a dataclass...
det = Detector(...)
for key, val in det.asdict().items():
...
Again, if we use dataclasses, this feature will be available to the Detector
class as well. Until them, you can use dir(det)
to get the same. As a plus, you get the name of the methods you can call on it as well. (See my point above about the easier discoverability of methods wrt functions.)
Same as for point 3 above.
Thanks Maurizio. I think we agree on most of the aspects (included the limitations of dictionaries). Let me try to bemore precise.
From the point of view of the IMO and interactive usage I would have basically the same ideas and priorities. As you know, I've been working mostly on the Observation
class and, because of its very different requirements, I tend to priorities other aspects--the Observation
objects are probably those that will interface with performance-critical sections of the code (inside or outside litebird_sim
).
I rephrase in this perspective (which is the one I care of and triggered this issue) at least my main concern: carrying around unused data. Either we add some filtering step (and we have to decide where) or the full set of metadata percolates all the way to Observation
. I find the latter problematic. Forcing Observation
to carry all the metadata seems a strong and unnecessary constraint. Moreover, in principle, this class has to be ready for heavy detector data (not just metadata). Either we threat them all alike--in this case tens of metadata are already a non-negligible number--or we have to distinguish between heavy data and light metadata. This adds a layer of complexity that I'd rather avoid.
What do you think? If the discussion gets too specific to Observation
its probably better to discuss in #42
In fact I suspected that your point was mainly concerned with the Observation
class. I am still trying to understand the various ways to use the matrix data layout; at the moment I was able to think of the following situations:
np.random.randn()
and then scale each of them using the white noise level; this makes the presence of a vector of white noise levels handy, because it means that you can scale it with one line of code.i
and j
have correlated samples. Using the matrix data layout is extremely efficient in this case. However, the covariance matrix is not going to be stored in a Detector
class or in a dictionary and passed around, because it's not specific to just one detector. It probably needs to be computed using some algorithm or be read from a dedicated IMO object, so it must be handled differently from other quantities in this list.fknee_hz
and the alpha
parameters could give any advantage here.So, of all the parameters to be stored either in Detector
or a dictionary, I see that only a handful of them might be useful when saved in vectors (highlighted in bold in the list above). I see the following alternatives:
We continue studying a general way to propagate arbitrary quantities across all the observations, making sure that when observation matrices are reshaped these attributes are reshaped as well.
Pros:
It lets the user to apply broadcasting operations whenever they think it's suitable, because everything is already in place.
Cons:
Not trivial at all to account for reshapes of the TOD matrix, according to what you say (and I fully believe you!)
We apply point 1. only to those quantities that we feel are worth the trouble (the items in bold in my list above).
Pros:
Probably simpler than solution 1.
Cons:
Leads to ugly code and repetitions
Harder to maintain: whenever a new quantity is deemed important to be vectorized, we need to alter the Observation
class.
We do not propagate metadata across observations, forcing the user to use the syntax
# If we use a `Detector` class
wn_vector = np.array([x.parameter for x in detector_list])
# If we use dictionaries
wn_vector = np.array([x["parameter"] for x in detector_list])
whenever a NumPy array of these parameters is needed. We might even provide some helper methods:
class Observation:
...
def get_det_param_vector(self, param):
return np.array([getattr(x, param) for x in detector_list])
# Nice to have, but not required
def get_wn_vector(self):
return self.get_det_param_vector("net_ukrhz")
# Define similar methods for other parameters, e.g., "adc_num_of_bits"?
Pros:
Very easy to implement
No reshaping is needed, as these methods are expected to be called immediately before you use their return value
Cons:
List comprehension can be slow when simulating many detectors.
If one needs the same quantity in a vector many times during the execution of the code, this can lead to inefficiencies, because the vector is computed over and over again.
As I already said a while ago, my personal preference goes to the third option. List comprehension is going to be applied to few detectors (the maximum number within one frequency band is a few hundreds at most), and running the implementation of get_wn_vector
I proposed above on an array of 500 detectors only takes ~30 µs.
Moreover, vectorized counterparts of these parameters are often going to be used just once in typical scripts: for instance, the number of bits in the ADC are to be used only in the few lines of code that simulate digital quantization:
maxval = 2 ** observation.get_det_param_vector("adc_num_of_bits")
quantized_values = np.array(
np.round(observation.tod_matrix * maxval[:, None]),
dtype='int',
)
This issue is very old, and the code has evolved since the last comment here: now we use dataclasses, which make the DictionaryInfo
(new name for Dictionary
) class convertible into a dictionary using the asdict
method. Shall we close this?
Definitely, closing now
Hi all, I believe that the
Detector
class adds more complications than benefits. Yes, I know that this topic was already discussed in #42 , but let me try again or at least summariz that discussion in a dedicated issue (even I had a hard time looking up the #42 PR)Benefits are (@ziotom78 please feel free to add/amend in case a missed or misunderstood anything)
__repr__
and__str__
Detector
stores all the possible piece of information about the detectors, the user always have them at his/her disposal (for example when adding a new stage to an existing pipeline)None
Compared to a simple dictionary, I found the following complications