litebird / litebird_sim

Simulation tools for LiteBIRD
GNU General Public License v3.0
18 stars 13 forks source link

Do we really want `Detector`? #53

Closed dpole closed 3 years ago

dpole commented 4 years ago

Hi all, I believe that the Detector class adds more complications than benefits. Yes, I know that this topic was already discussed in #42 , but let me try again or at least summariz that discussion in a dedicated issue (even I had a hard time looking up the #42 PR)

Benefits are (@ziotom78 please feel free to add/amend in case a missed or misunderstood anything)

  1. detectors are an important concept, they deserve a class
  2. the class can provide custom __repr__ and __str__
  3. the class can enforce the attribute names, which helps the coherence across the framework
  4. since the Detector stores all the possible piece of information about the detectors, the user always have them at his/her disposal (for example when adding a new stage to an existing pipeline)
  5. In spite of 4., at construction time we can specify only the parameters we need: all the rest is defaulted to None

Compared to a simple dictionary, I found the following complications

  1. I think that the benefits of 4 are largely outweighed by the fact that every user will be interested in a small subset of quantities, probably different for every user. Still, if any user at any time will need a detector quantity, we will have to add it to the `Detector` attributes. The result is that *we force all the users to carry a lot of quantities even if they only care about a few*.
  2. I think that *the dictionary is a more natural and transparent data structure*, which the user find more comfortable with. The following points are examples of this general idea
  3. Sometimes you want to iterate over the quantities: it is more natural if the detector is a dictionary
  4. Getting all the available detector quantities is much more natural ( det.keys() ) if det is a dictionary
  5. it happens that the name of the quantity you want to access is stored in a variable, I prefer det[variable] over getattr(det, variable)
dpole commented 4 years ago

About the benefits, 1. is a very popular idea in the c++ community, but I don't think it has practical advantages so I wouldn't use it as an argument. About 2, true, but I wouldn't consider it a driver. About 3, it does not sound so different from providing a list of names that people should conform with. About 5, it is a hack to try to overcome the problems introduced by 4

ziotom78 commented 4 years ago

Thanks for having opened this, @dpole! Before discussing your list, it's better to state a few facts:

Here are my comments about the points you listed above:

  1. The importance of using classes in specific places can bring several advantages, which are relevant even outside the C++ community (in fact, my reference to C++ was just a joke). As an instance, the Django framework forces you to build your data model using objects. This is relevant, because it's the way I implemented entities, quantities, data files, etc. in instrumentdb, e.g.:

     # Abridged
     class DataFile(models.Model):
         uuid = models.UUIDField(...)
         name = models.CharField(...)
         upload_date = models.DateTimeField(...)
         metadata = models.TextField(...)
         file_data = models.FileField(...)
         quantity = models.ForeignKey(...)
         spec_version = models.CharField(...)
         dependencies = models.ManyToManyField(...)
         plot_file = models.FileField(...)
         ...

    Django is one of the most famous Python frameworks in the world, so I believe that we can consider the usage of classes in Python to be fully «pythonic». This being said, classes have several advantages:

    1. The type of a parameter is self-evident in declarations; compare

        def get_cosmic_ray_hit_ratio(det: Detector):
           ...

      with

        # Which fields am I expected to pass here?!?
        def get_cosmic_ray_hit_ratio(det: Dict[str, Any]):
           ...
    2. The type of the class' fields is explicit:

        class Detector:
            fknee_mhz: float
            pixel: int
            ...

      This is great for self-documentation (modern editors are even able to provide type hints while you're writing code), and it allows code checkers like mypy, pyre-check, pyright, or pytype to spot type errors. Example:

        # True story: I really made this mistake a few weeks ago
        Detector = det.from_imo(imo, ...)
        det.pixel = "lft-002"  # Force a different pixel

      Using mypy, you get:

        $ mypy myscript.py
        myscript.py:7: error: Incompatible types in assignment (expression has type "str", variable has type "int")

      We could even enforce correct types if we employ the attrs library; see the page about validators, which shows that you can raise a runtime error if you call Detector(pixel="a") and the type of pixel is an integer.

    3. The __init__ constructor can initialize each value to a sensible default and check for inconsistencies and wrong numbers (e.g., an instrument name that does not match lft, mft, or hft). The current implementation of Detector.__init__ already does something of this sort: if you do not provide a quaternion, it will use the tuple 0, 0, 0, 1, which corresponds to a detector aligned with the boresight. Another nice addition would be to automatically normalize the quaternion to enforce the requirement that it always represents a rotation, etc.

    4. In some cases it makes sense that functions that operate on members of the class be turned into methods of that class. Consider this:

        def get_cosmic_ray_hit_ratio(det: Detector) -> float:
            ...

      versus this:

        class Detector:
            ....
            def get_cosmic_ray_hit_ratio(self) -> float:
                ...

      The biggest advantage is discoverability: if I don't remember how to compute the ratio of cosmic ray hits while I'm coding in a Jupyter notebook, I can just write

        det = Detector()
        det.█   # The cursor is here

      and pressing TAB I'll see what methods and fields are available, instead of digging through all the pages of documentation.

      Thus, I believe that classes bring several usability advantages over dictionaries.

  2. It depends on the workflow, but I agree that it's not a killer feature

  3. For me the feature is important. First, it enables auto-completion in modern editors:

     det = Detector(█ # I want to specify the knee frequency: what should I write?

    If I don't remember the name of the parameter (is fknee_mhz, frequency_knee_mhz or fknee_mHz?), pressing TAB will present the alternatives and complete the name for me. OTOH, if I rewrite the statement above using dictionaries, I either need to read the documentation or run the code and wait for runtime errors later (or wrong values in the output!):

     det = { "fknee_mHz": 10.0 }  # Oh no, I mispelled it!

    Moreover, code checkers have a chance to spot bugs here as well:

     # Another true story from not much time ago...
     # I keep doing the same errors over and over again
     det = Detector.from_imo(imo, "...")
     det.fknee_mHz *= 2  # Typo here!

    Using a type checker like mypy causes the following error:

     $ mypy myscript.py
     myscript.py:7: error: "Detector" has no attribute "fknee_mHz"; maybe "fknee_mhz"?

    None of thes features is achievable through a list of names in the documentation.

  4. Nothing to add

  5. It's not true that everything is defaulted to None, because for instance the quat field is set to the unit quaternion (0, 0, 0, 1), which represents the boresight (see above). I believe that the ability to specify different defaults and have complex initialization code (e.g., quaternion normalization) is not a «hack» but a very handy feature. For instance, we might decide to automatically initialize the instrument name and pixel number whenever the detector's name is passed explicitly and matches some regexp:

    det = litebird_sim.Detector("L00_003_AQ_040T")
    print(det.wafer)  # Prints "L00"
    print(det.pixel)  # Prints 3

    There is no clean way to do this using plain dictionaries, because dictionaries do not have constructors. To me it's important to provide a handy way to properly initialize Detector instances, as this will be an activity that 99% of the users will do in all their scripts.

  6. There is another point that is missing in your list: I believe that using dictionaries is not pythonic in this case. IMO, dictionaries should be used whenever the feature to add/remove arbitrary keys at runtime is important, because you cannot reasonably expect what it is going to be put inside your variable (e.g., you're reading a JSON file provided by the user). But the fields in a Detector class are already settled, both in their name and type: they are all and only the metadata in the quantity (detector) info, which are listed in the IMO Format Specification Document (i.e., outside our framework) and are going to be put in the IMO Specification Document as well. As I said at the top, the purpose of the Detector class is to provide a one-to-one match with the IMO quantity, and thus we will never add/remove fields to it at runtime. (This is one of the reasons why Django's ORM uses classes instead of dictionaries to model database entities: they are not expected to change at runtime, and if you want to add/remove fields you must stop the program, run a migration, and start the server again.)

Here is my take regarding the list of complications (sorry, they should have been listed with letters instead of numbers):

  1. The fields in Detector closely match the metadata in the info quantity in the IMO database. When things will be streamlined, to add a new field to an info quantity you will first need to update the Format Specification Document, have it agreed by the IMO team, and then update the instrument database. I don't expect we'll have to do this often. Moreover, you cannot store more than a few kb of data in the metadata field of data files in the IMO database, because it's a string field in a database table. There aren't going to be «a lot» of parameters to be included in the Detector class: an upper bound might be 20–30. (Other stuff will go in dedicated IMO quantities, like grasp_beam or bandshape, which will probably deserve their own Python classes, with a classmethod constructor named from_imo like for Detector.)

    Since Detector will always have a low memory footprint, I see no significant objection to carry around fields that the user does not plan to use at the moment.

    Finally, if we do not adopt this idea, how are users supposed to read detector's data from the info quantities in the IMO? Currently, the IMO database is a all-or-nothing thing: once you query the metadata field of a info quantity you're reading all the metadata in memory. There is no way to avoid this, because it's related to the way tables are initialized and accessed by SQL commands. The only ways I can put together do not look pleasant at all:

     # First take: ugly!
     det_name = "L00_003_AQ_040T"
     imo_path = f"/releases/v1.0/satellite/LFT/L1-040/{det_name}/info"
     det = {
         "pixel": imo.query_metadatum(imo, imo_path, "pixel"),
         "wafer": imo.query_metadatum(imo, imo_path, "wafer"),
         "net_ukrts": imo.query_metadatum(imo, imo_path, "net_ukrts"),
     }
    
     # Second take: better, but still complicated (you need to
     # remember how to spell "net_ukrts") and not efficient:
     # the function "build_from_imo_metadata" queries *all* the
     # metadata and then filters out those that are not listed here.
     imo_path = f"/releases/v1.0/satellite/LFT/L1-040/{det_name}/info"
     det = build_from_imo_metadata(imo, imo_path, fields=[
         "pixel",
         "wafer",
         "net_ukrts",
     ]) # This should return a dictionary

    I find any of the following much more pleasant:

     # This works with the code in `master` (returning a Detector object)…
     det = Detector.from_imo(
         "/releases/v1.0/satellite/LFT/L1-040/L00_003_QA_040T/info",
    
     # …and this works too (returning a dictionary)
     det = imo.query("/releases/v1.0/satellite/LFT/L1-040/L00_003_QA_040T/info")["metadata"]

    This is what 100% of the users will do whenever they want to read a (detector) info quantity from the IMO. Yet, in both cases they will be reading all the fields, even if they are not going to use all of them.

  2. I think the opposite, but everybody is entitled to have their own opinions :-)

  3. True, but… «I wouldn't consider this a driver». How many times does a user need to do this? It seems to me that the only case where this is relevant is the code that (re-)distributes detectors among MPI processes in the Observation class, which might benefit from an easy way to iterate over the keys. I do not expect end users to rely on this very often, but I could possibly miss some cases. Do you have something specific in mind?

    In any case, if we use dataclasses, it will be easy to implement this thanks to the asdict method:

     # Assuming that Detector is a dataclass...
     det = Detector(...)
     for key, val in det.asdict().items():
         ...
  4. Again, if we use dataclasses, this feature will be available to the Detector class as well. Until them, you can use dir(det) to get the same. As a plus, you get the name of the methods you can call on it as well. (See my point above about the easier discoverability of methods wrt functions.)

  5. Same as for point 3 above.

dpole commented 4 years ago

Thanks Maurizio. I think we agree on most of the aspects (included the limitations of dictionaries). Let me try to bemore precise.

From the point of view of the IMO and interactive usage I would have basically the same ideas and priorities. As you know, I've been working mostly on the Observation class and, because of its very different requirements, I tend to priorities other aspects--the Observation objects are probably those that will interface with performance-critical sections of the code (inside or outside litebird_sim).

I rephrase in this perspective (which is the one I care of and triggered this issue) at least my main concern: carrying around unused data. Either we add some filtering step (and we have to decide where) or the full set of metadata percolates all the way to Observation. I find the latter problematic. Forcing Observation to carry all the metadata seems a strong and unnecessary constraint. Moreover, in principle, this class has to be ready for heavy detector data (not just metadata). Either we threat them all alike--in this case tens of metadata are already a non-negligible number--or we have to distinguish between heavy data and light metadata. This adds a layer of complexity that I'd rather avoid.

What do you think? If the discussion gets too specific to Observation its probably better to discuss in #42

ziotom78 commented 4 years ago

In fact I suspected that your point was mainly concerned with the Observation class. I am still trying to understand the various ways to use the matrix data layout; at the moment I was able to think of the following situations:

  1. Simulating ADC quantization can be easily implemented using broadcasting operations on the matrix of samples in a Observation. Thus, the size in bits of an ADC sample might be handy when stored in a vector.
  2. You generate white noise for all the detectors using np.random.randn() and then scale each of them using the white noise level; this makes the presence of a vector of white noise levels handy, because it means that you can scale it with one line of code.
  3. Although I cannot think of any practical example, probably bandwidths and band centers could have some use if stored in a vector.
  4. You introduce noise correlations among the set of detectors by means of a covariance matrix, which specificies how much two detectors i and j have correlated samples. Using the matrix data layout is extremely efficient in this case. However, the covariance matrix is not going to be stored in a Detector class or in a dictionary and passed around, because it's not specific to just one detector. It probably needs to be computed using some algorithm or be read from a dedicated IMO object, so it must be handled differently from other quantities in this list.
  5. It's not obvious to me how to use matrix operations for 1/f noise generation; either you generate it using a filter in Fourier domain (like TOAST does), or you use a Markow chain (like we did in Planck). In either way, I fail to see how having vectors containing the fknee_hz and the alpha parameters could give any advantage here.
  6. Applying a gain (i.e., a multiplicative factor) to the TOD can be easily implemented using matrix operations that work on the whole TOD. However, the gains are going to be taken neither from the IMO nor from a dictionary: they are likely to be simulated on the fly, like 1/f noise, and thus what I wrote in the previous point applies here as well.
  7. Gain reconstruction (i.e., estimating the gain, vs estimation in the previous point) are usually carried on using conjugate gradient methods, which usually perform complex operations on the TOD but do not require complex inputs: just the timescale of gain variations and the baseline for the destriper, but there is little advantage in keeping these parameters in NumPy vectors. (And I believe that they will always be the same for any detector regardless of the frequency, so they should probably be passed as parameters in a TOML file.)
  8. I see no point in having the parameters relevant for cosmic ray calculations (e.g., the number of the pixel, the name of the wafer) stored in vectors, as the kind of operations done on them do not permit trivial broadcasting operations.
  9. Beam convolutions must of course be carried on the level of the single detector; when bandshapes are used in conjunction with them (e.g., to study the impact of sidelobes at different frequencies within the band), they too do not take any advantage on being stored in vectors.

So, of all the parameters to be stored either in Detector or a dictionary, I see that only a handful of them might be useful when saved in vectors (highlighted in bold in the list above). I see the following alternatives:

  1. We continue studying a general way to propagate arbitrary quantities across all the observations, making sure that when observation matrices are reshaped these attributes are reshaped as well.

    Pros:

    • It lets the user to apply broadcasting operations whenever they think it's suitable, because everything is already in place.

      Cons:

    • Not trivial at all to account for reshapes of the TOD matrix, according to what you say (and I fully believe you!)

  2. We apply point 1. only to those quantities that we feel are worth the trouble (the items in bold in my list above).

    Pros:

    • Probably simpler than solution 1.

      Cons:

    • Leads to ugly code and repetitions

    • Harder to maintain: whenever a new quantity is deemed important to be vectorized, we need to alter the Observation class.

  3. We do not propagate metadata across observations, forcing the user to use the syntax

     # If we use a `Detector` class
     wn_vector = np.array([x.parameter for x in detector_list])
     # If we use dictionaries
     wn_vector = np.array([x["parameter"] for x in detector_list])

    whenever a NumPy array of these parameters is needed. We might even provide some helper methods:

     class Observation:
         ...
    
         def get_det_param_vector(self, param):
             return np.array([getattr(x, param) for x in detector_list])
    
         # Nice to have, but not required
         def get_wn_vector(self):
             return self.get_det_param_vector("net_ukrhz")
    
         # Define similar methods for other parameters, e.g., "adc_num_of_bits"?

    Pros:

    • Very easy to implement

    • No reshaping is needed, as these methods are expected to be called immediately before you use their return value

      Cons:

    • List comprehension can be slow when simulating many detectors.

    • If one needs the same quantity in a vector many times during the execution of the code, this can lead to inefficiencies, because the vector is computed over and over again.

As I already said a while ago, my personal preference goes to the third option. List comprehension is going to be applied to few detectors (the maximum number within one frequency band is a few hundreds at most), and running the implementation of get_wn_vector I proposed above on an array of 500 detectors only takes ~30 µs.

Moreover, vectorized counterparts of these parameters are often going to be used just once in typical scripts: for instance, the number of bits in the ADC are to be used only in the few lines of code that simulate digital quantization:

maxval = 2 ** observation.get_det_param_vector("adc_num_of_bits")
quantized_values =  np.array(
    np.round(observation.tod_matrix * maxval[:, None]), 
    dtype='int',
)
ziotom78 commented 3 years ago

This issue is very old, and the code has evolved since the last comment here: now we use dataclasses, which make the DictionaryInfo (new name for Dictionary) class convertible into a dictionary using the asdict method. Shall we close this?

dpole commented 3 years ago

Definitely, closing now