spacepy / dbprocessing

Automated processing controller for heliophysics data
5 stars 4 forks source link

getFiles doesn't return newest version if start/stop time different across versions #45

Open jtniehof opened 3 years ago

jtniehof commented 3 years ago

I'm going to call this a bug, although it's maybe more "unintended consequence of design decision." if getFiles is called with a startTime and/or endTime specified, it will retrieve all files that match those start/end, and then filter those by version if newest_version is specified. The filtering is done based on getting the latest version of every utc_file_date.

The problem here is if the different versions have different start/stop times that span days. If 20190101 v1.0.0 has stop time 2019-01-02T00:01:00 (because maybe there was a bug in selecting the input data), and 20190101 v1.1.0 has stop time 2019-01-01T23:59:30, then a latest_version request for 20190102 will include 20190101 v1.0.0, as it's the latest version of 20190101 that has data on 20190102.

This is biting me because there was a bug in selecting the input data that got fixed, but the old files are of course still around in the db.

Minimal example to reproduce issue:

I'll work this into a unit test in the future.

Version of dbprocessing

git master (8e5d3ae432fb35227c74bc5615422cf456b39578)

Closure condition

This issue should be closed when a PR is merged that includes a unit test for earlier versions with wrong days which passes (i.e. does not return that file if the newer version of that file doesn't have that day.)

jtniehof commented 3 years ago

I knew there was a reason we had the newestVersion flag in the db.

jtniehof commented 3 years ago

I think this might be solvable with a group by utc_file_date order by version on the query.

balarsen commented 3 years ago

I think this might be solvable with a group by utc_file_date order by version on the query.

I believe this is be exactly the way to do this. I thought that is how it was done...

fileIsNewest(self, filename, debug=False) calls out to latest = self.getFilesByProductDate(product_id, [date]*2, newest_version=True) which goes to

        return self.getFiles(startDate=min(daterange),
                             endDate=max(daterange),
                             product=product_id,
                             newest_version=newest_version)

And that is wrong.

In my vision for how to do this was done with utc_file_date, i suspect it did at one point (maybe).

We do need to think hard and fix this one.

jtniehof commented 3 years ago

This is only an issue when using utc_start_time and utc_stop_time, which of course is now the norm. getFiles doesn't call fileIsNewest (don't think it ever has, since that would be a bit circular, although it would be an easy cheap fix.)

There's really three things that lead to this: you did a significant rearrange that switched to using utc_file_date instead of start_time and stop_time because it was getting intractable; then @MylesJohnson did a huge revamp (I think the generic getFiles was a part of that revamp) that did a lot for speed, flexibility, and just plain right answer; and then I added back in the start_time/stop_time search. We can get this right...we've got a lot of improvement to architecture, much better testing so we can catch corner cases, and overall a better knowledge of how to squeeze the database than we did ten years ago.

In the meantime I'm going to punt by munging the unix start/stop time for the messed up files in my db, but will then look at fixing properly.

balarsen commented 3 years ago

OK yes I agree. All of the getFilesXXX routines call into getFiles which is good.

when newest_version is set it hits this

        if newest_version:
            files = files.order_by(self.File.interface_version, self.File.quality_version, self.File.revision_version)
            x = files.limit(limit).all()

Which was populated by a heck of a lot of filters.

I think if the startDate is set you probably get the correct file per

        if startDate is not None:
            if endDate is not None:
                files = files.filter(self.File.utc_file_date.between(
                    startDate, endDate))
            else: # Start date only
                files = files.filter(self.File.utc_file_date >= startDate)
        elif endDate is not None: # End date only
            files = files.filter(self.File.utc_file_date <= endDate)

since that is a slice on the utc_file_date

Am I thinking on this correctly?

jtniehof commented 3 years ago

If you're filtering on date rather than time, yes, it should be correct.

The real magic in newest_version is the next line, which populates a dict...the order_by just makes sure that the last one in is the highest version number. Which is the latest version if all files of that utc_fie_date are in the list, which will be true as long as startTime/endTime aren't set. It's subtle stuff.

jtniehof commented 3 years ago

Incidentally although this should definitely be fixed, I'm wondering if moving buildChildren away from doing this lookup one-by-one for everything in the process queue is a good idea. @MylesJohnson also did some really cool work with a graph-based representation of the database that takes a little while to build but is blazing fast once built. If the process queue is more than a few files long it might be worth exploiting that.

jtniehof commented 3 years ago

Diverging from the baseline issue here but I also wonder if there isn't some savings to be had where if files a, b, and c are all on the process queue, and in calculating children for a it's determined that b and c are required, do b and c need a full check for children or if some of the work done already.

jtniehof commented 3 years ago

47 is sort of tangled up with this, and in writing that up I realized that newest_version from getFiles returns the newest version of all the results, not the absolute newest version of the utc day + product combination. It's going to be tricky to get exactly the right query, because in some cases the newest version might not be included in the result set at all, in which case we want to have nothing come back....based on rows that aren't returned from the db at all! It's almost like the order-by and group-by have to happen before any filtering.

jtniehof commented 3 years ago

Just making a note here that I think we might be able to address the problem in my comment of 2020-10-22 by doing another lookup on utc_file_date to make sure the version we have is the latest (and, if not, just toss that file from the list altogether.) I do expect to be able to have some time to work on this soon; it's getting annoying.

jtniehof commented 2 years ago

Alternative to my previous comment of 2021-07-15: first lookup for all files with a start/stop time within the range to get all the utc file dates relevant, then lookup on utc file date and get the newest, then toss any resulting UTC file dates with start/stop outside the range. This might or might not be faster than the idea in that comment.