Duplicate tasks? - Githubissues

lsst-dm / ProdStat

Software to collect DP02 production statistics.

GNU General Public License v3.0

1 stars 1 forks source link

Duplicate tasks? #2

Open QLeB opened 2 years ago

QLeB commented 2 years ago

In the results produced by 'get-butler-stat' I'm seeing more tasks than what has actually run. I suspect this is related to the use of queryDatasets which do not de-duplicate results (see https://pipelines.lsst.io/v/weekly/middleware/faq.html#why-do-queries-return-duplicate-results). set() could be used to deduplicate.

QLeB commented 2 years ago

In confirm that using dataset_refs = set(self.registry.queryDatasets( pattern, collections=collection )) in https://github.com/lsst-dm/ProdStat/blob/main/python/lsst/ProdStat/GetButlerStat.py#L245 leads to consistent number of tasks. For example:

campaign │ 151737

QuantumGraph contains 151737 quanta for 42 tasks

QLeB commented 2 years ago

Sorry, with this modification the number of tasks is correct but all other results are wrong, because the tasks are sorted by type of task. If someone has a suggestion for this that would be appreciated.

timj commented 2 years ago

@QLeB note that the development of this package is happening in lsst-dm/prodstatus#1

kuropat commented 2 years ago

I have put set() in dataset_refs = set(self.registry.queryDatasets( pattern, collections=collection )) and have run the program on step1 in test-med-1 data with and without set(). The result is consistent. So, I will leave the set in the program. Please, check out the latest version of the program as it was rebuild recently.

QLeB commented 2 years ago

Thanks @timj for pointing me to the right repository! I was missing a lot of recent developments.

Thanks @kuropat for having a look. I reinstalled everything from the "tickets/PREOPS-1041" branch and made some tests with a step3 run. Results seems more consistent with the set(), but I have a lot of "Task X has no metadata" message, and thousands of tasks are missing. This is due to the data_id which are wrong, for example for an AM3 task it returns {instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 4033, patch: 12, visit: 656523} which makes no sense since this is at the tract level (and it gives a KeyError because the band is missing).

So it seems that the deduplication / reordering of the datasets is causing issue but I don't see why...