Open QLeB opened 2 years ago
In confirm that using dataset_refs = set(self.registry.queryDatasets( pattern, collections=collection ))
in https://github.com/lsst-dm/ProdStat/blob/main/python/lsst/ProdStat/GetButlerStat.py#L245 leads to consistent number of tasks. For example:
campaign │ 151737
QuantumGraph contains 151737 quanta for 42 tasks
Sorry, with this modification the number of tasks is correct but all other results are wrong, because the tasks are sorted by type of task. If someone has a suggestion for this that would be appreciated.
@QLeB note that the development of this package is happening in lsst-dm/prodstatus#1
I have put set() in dataset_refs = set(self.registry.queryDatasets( pattern, collections=collection )) and have run the program on step1 in test-med-1 data with and without set(). The result is consistent. So, I will leave the set in the program. Please, check out the latest version of the program as it was rebuild recently.
Thanks @timj for pointing me to the right repository! I was missing a lot of recent developments.
Thanks @kuropat for having a look. I reinstalled everything from the "tickets/PREOPS-1041" branch and made some tests with a step3 run.
Results seems more consistent with the set(), but I have a lot of "Task X has no metadata" message, and thousands of tasks are missing. This is due to the data_id which are wrong, for example for an AM3 task it returns
{instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 4033, patch: 12, visit: 656523}
which makes no sense since this is at the tract level (and it gives a KeyError because the band is missing).
So it seems that the deduplication / reordering of the datasets is causing issue but I don't see why...
In the results produced by 'get-butler-stat' I'm seeing more tasks than what has actually run. I suspect this is related to the use of queryDatasets which do not de-duplicate results (see https://pipelines.lsst.io/v/weekly/middleware/faq.html#why-do-queries-return-duplicate-results). set() could be used to deduplicate.