Re-think sim_type - Githubissues

RasmusOrsoe commented 1 year ago

Is your feature request related to a problem? Please describe. In I3TruthExtractor we rely on the variable sim_type to modify the extractor's behavior. The variable is inferred by the extractor in a very crude way:

def _find_data_type(self, mc: bool, input_file: str) -> str:
        """Determine the data type.

        Args:
            mc: Whether `input_file` is Monte Carlo simulation.
            input_file: Path to I3-file.

        Returns:
            The simulation/data type.
        """
        # @TODO: Rewrite to automatically infer `mc` from `input_file`?
        if not mc:
            sim_type = "data"
        else:
            sim_type = "NuGen"
        if "muon" in input_file:
            sim_type = "muongun"
        if "corsika" in input_file:
            sim_type = "corsika"
        if "genie" in input_file or "nu" in input_file.lower():
            sim_type = "genie"
        if "noise" in input_file:
            sim_type = "noise"
        if "L2" in input_file:  # not robust
            sim_type = "dbang"
        if sim_type == "lol":
            self.info("SIM TYPE NOT FOUND!")
        return sim_type

A more elegant solution is needed.

Describe the solution you'd like We should try to come up with a way to either remove the need for thesim_typevariable, or a more robust way to infer it.

Additional context To my knowledge, it is often that i3 files doesn't contain identifying markers that we can use for this.

MortenHolmRep commented 1 year ago

If I understand correctly, do you want pattern recognition (for example regex) to analyse the string name of an i3 file (or its keys) to infer sim type, but without saving it as a column in the database? As I recall the naming convention on i3 file names can vary a lot and I am unsure if it is a well-composed problem, in that this information is always included and I am unsure whether we can include all uniqueness of a name.

RasmusOrsoe commented 1 year ago

Hey @MortenHolmRep! I completely agree that using file names for this is sub-optimal; we could use a more elegant approach to string parsing of file names, but it would not change that it seems like we cannot robustly rely on the file names for this.

I think the path forward would be to triple check that there indeed is no consistent frame keys we can rely on instead. If not, then we could attempt to rewrite the extractor with try/except to catch the cases that the sim_type variables toggle between. From memory, I think only the data/mc/noise labels are important; the rest can be removed.

graphnet-team / graphnet

Re-think sim_type #519