metno / pyaerocom

Python tools for the AeroCom project
https://pyaerocom.readthedocs.io/
GNU General Public License v3.0
23 stars 13 forks source link

Update caching strategy for ungridded data #1242

Closed heikoklein closed 4 days ago

heikoklein commented 5 days ago

Is your feature request related to a problem? Please describe. The cached ungridded data objects are very often (more or less daily) re-evaluated, though creation takes several hours. The rules for cache-rejections are too strict:

{
    'pyaerocom_version': '0.20.0',
    'newest_file_in_read_dir': 'data',
    'newest_file_date_in_read_dir': 1719824185.0,
    'data_revision': '20240627',
    'reader_version': '0.52_0.09',
    'ungridded_data_version': '0.22',
    'cacher_version': '1.12'}

The fields causing problems are:

Describe the solution you would like to see

lewisblake commented 5 days ago

I think you nailed it. This seems like a better approach to the cache invalidation.

jgriesfeller commented 5 days ago

Although at least at the file system level some file systems have higher mtime resolution than just one second, it seems Python translates that to one sec resolution only. But I agree that searching through thousands of files is not a good idea for cache invalidation. I also agree that we should not use pyaerocom_version for cache validation. We need to make sure that all obs networks really provide a revision string. Not all might do that correctly as it is not mandatory. If all do, basing the cache invalidation on data revision number and the ungridded revision number should do the job.

heikoklein commented 4 days ago

Thanks for the feedback. I will remove the dependency on the pyaerocom_version and switch from ctime to mtime.

I will open a separate ticket to make a revision string from the file-readers or observation networks mandatory. This will take some more time, because we have to check all obs-readers.