legend-exp / legend-dataflow

LEGEND data flow management
Other
2 stars 11 forks source link

Make executables easier to use for non-experts #31

Open iguinn opened 10 months ago

iguinn commented 10 months ago

Right now, it's very difficult to figure out how to use the executables to process L200 data directly by non-experts. This can be important for several reasons, including testing changes to the analysis and processing test_data that would otherwise not get processed. These were very clearly written for use with snakemake, and it's very difficult to figure out what options to pass to them to get them to work in the command line; with a few changes, I think we could have it working both with snakemake and working for individual users. Some changes that might help:

1) Better doc strings for argparse options. For example, in build_raw, two required arguments are:

argparser.add_argument("--datatype", help="Datatype", type=str, required=True)
argparser.add_argument("--timestamp", help="Timestamp", type=str, required=True)

It's unclear what datatype is supposed to be (we were able to figure it out eventually be digging through the code); timestamp is a bit more intuitive, but when things are breaking, it's nice to have a clear string so we know we passed the right thing. This is a problem with almost all of the arguments!

2) Make more of the argparse options optional/automatic. For example:

argparser.add_argument("--configs", help="config file", type=str)
argparser.add_argument("--chan_maps", help="chan map", type=str)

(note these not marked as required, even though the code breaks without them; also the help string is not helpful). These could be very easily automated by calling Metadata without a directory so that it downloads it from github, and then call meta.hardware.config to get them. If they had been automated it would have saved a lot of time in trying to figure out what exactly to pass.

These are just some examples; we really need to figure out which python executables are likely to be needed by the larger collaboration, and make sure that they are easy to figure out and get running outside of the scope of snakemake. In the past we've argued that we shouldn't expect more than a few people to learn snakemake, but if we are going to take that approach, we need to be able to run this code without it!

ggmarshall commented 10 months ago

Sure I can definitely make the help strings more descriptive. Regarding test data if it is needed it can easily be processed, we just need to be informed as it is not in the automatic flow due to it possibly having different formats or just being for daq tests etc. (if it is important legend data it also should not be tst but have its own datatype). I'm also a little unclear on the exact use cases here, for the tier processing this can just be handled through the pygama cli so I'm not sure if the full dataflow script is even needed? And for the par scripts these have so many inputs I'm not sure they are ever going to be intuitive to use (some take every file in the calibration for example). Regarding having automatic options, I am not a fan of this. The dataflow is designed to be run inside of a production cycle and having scripts that clone/access data outside is not part of the design. This has the potential to introduce very nasty/hard to find errors in our processing so I would really avoid this. Maybe this can be discussed more in a dev call so we can understand what is needed by users and try to come up with a solution but my feeling is that some kind of bare bones/stripped down "sandbox" dataflow would be a better solution (although this would take a lot of work to design/implement and so is not something achievable short term).

iguinn commented 10 months ago

My understanding is that running build_raw for L200 doesn't work without being tied into legend-metadata, which means that it has to be run from legend-dataflow. If that's not the case, then this may be a bit of a moot issue, but we still should make sure that it's documented how to use build_raw from daq2raw...And even if it is the case I suppose we don't necessarily need to have people run these python executables (but again, people need to know what to do).

I think this extends beyond build_raw, too; build_dsp requires the database for building the db parameters, and build_hit I'm less familiar with but also requires pretty deep integration with the metadata.

iguinn commented 10 months ago

This may be a moot issue soon based on our discussion today, but I have thought of another way to get around the issue you raised. We could add a --auto option that must be explicitly evoked that automatically fills in any of the options that aren't provided. This way the user must be intentional that they want this behavior, but it still provides a way to avoid some of the unexpected consequences you are worried about. I can probably implement an example pretty quickly from one of them if you are interested.

iguinn commented 1 month ago

So, as mentioned on our most recent SW dev call this is something I'd like to revisit. I can think of a couple ways we could approach this:

  1. Have a python function that requires very specific inputs and works how the executables work now, import that into the snakemake file and call it directly from snakemake. Then develop an if __main__ interface that uses argparse to fill out those arguments
  2. Use argparse argument groups, so that you would call, e.g., l200_build_dsp snakemake [long list of args] or l200_build_dsp prodenv [short list of args that get supplemented by the $PRODENV variable, and maybe some other options. These would be exclusive argument groups, to avoid any mixing and matching that could cause unexpected behavior in snakemake

Between these two options, I think I favor option 2. Option 1 is a bit simpler, but it seems like there are advantages to having snakemake work through the shell (e.g. you can copy paste a command that fails for debugging).

ggmarshall commented 1 month ago

I agree option 2 would be better for debugging