spacepy / dbprocessing

Automated processing controller for heliophysics data
5 stars 4 forks source link

Handle timestamp HHMMSS using processqueue.py -p #96

Open dnadeau-lanl opened 2 years ago

dnadeau-lanl commented 2 years ago

Use more than daily date for processing hourly files. Some files are received every hour, and are discarded from the process queue script and only the newest file is being procesed.

Relation to an issue

38

Proposed enhancement

Process file within a day.

For example the following files are not correctly processed by dbprocessing. The inspector extract the Day-of-Year (288) and convert into date for unixtime. Even if we pass the HHMMSS to unix_start_time and unix_stop_time they are discarded when running the process queue script to start processing.

 STP6_2020288230002_SENH_VC34.0
 STP6_2020288220002_SENH_VC34.0
 STP6_2020288210002_SENH_VC34.0

When running processqueue.py -p we get the following message and only the last file is retain.

https://github.com/spacepy/dbprocessing/blob/master/dbprocessing/dbprocessing.py#L368

Proposed code

https://github.com/spacepy/dbprocessing/blob/master/dbprocessing/DButils.py#L683

dates = [file.utc_start_time, file.utc_stop_time]
latest = self.getFilesByProductTime(product_id, dates, newest_version=True)

https://github.com/spacepy/dbprocessing/blob/master/dbprocessing/dbprocessing.py#L296-L301

start_time = sq.utc_start_time
stop_time = sq.utc_stop_time

https://github.com/spacepy/dbprocessing/blob/master/dbprocessing/runMe.py#L294-L295 Comment out thise lines

#if isinstance(utc_file_date, datetime.datetime):
#    utc_file_date = utc_file_date.date()

Alternatives

They are currently no alternatives.

I tried using "RUN", "FILE" as output_timebase instead of "DAILY". Using the changes above, I am currently processing using output_timebase="FILE". I would like to use output_timebase="RUN" with no output_product.

OS, Python version, and dependency version information:

Version of dbprocessing

spacepy/master branch

Closure condition

This issue should be closed when:

  1. code can handle to files with same date but different timestamp (HHMMSS)
  2. tests are written and updated to handle these cases. Test will fail since they check only date in the format YYY-MM-DD without checking the timestamp (HHMMSS)
jtniehof commented 2 years ago

This is going to need substantial work, since it requires a change to a key dbprocessing design concept: product plus utc_file_date plus version is unique, and (similarly) there is only one newest version for a given utc_file_date and product.

The utc_file_date is intentionally distinct from the utc_start_time and utc_stop_time. utc_file_date is the "characteristic" date of the file (generally represented in the filename) and the utc_start_time/utc_stop time are the actual first and last timestamps in the file. In some cases they don't line up perfectly (ECT L0 and L0.5 files in particular.) This is where #97 falls apart.

One of the things we had in mind was to extend out the timebase support to MONTHLY and YEARLY. HOURLY doesn't fit in quite the same mold, but is possible. It will certainly require database changes.

So I think there's a lot of preparing-the-ground work for this:

jtniehof commented 2 years ago

I'm thinking that if utc_file_date changes to something like utc_file_start or utc_file_period_start or something like that, this is probably pretty doable. I'll keep calling it utc_file_date for now, but it would represent the "characteristic" start of the file, i.e. what "should" be in it. So for a DAILY file, it would be the same, YYYY-MM-DD 00:00:00, with the idea anything before YYYY-MM-DD+1 00:00:00 "belongs" in there (for the sake of doing searches for input files.) HOURLY would have, say YYYY-MM-DD 00:00:00 but then that file is anything before YYYY-MM-DD 01:00:00. MONTHLY is an obvious extension.

WEEKLY gets tricky, but can be punted as we don't need it right now.

You don't have anything weird that requires, say, two-hour files, do you?