Allow loaders to append data to an existing activity

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Execute a load of data
2. Rerun the load for new data that have been appended to the source netCDF file
3. Repeat this a 100 times or more

What is the expected output? What do you see instead?

We'd like to see fast response from the STOQS UI, but this is not the case. For 
the September 2013 CANON Campaign it takes 2 minutes to transfer 27 MB of JSON 
data to the client.

Original issue reported on code.google.com by MBARIm...@gmail.com on 5 Nov 2013 at 6:32

GoogleCodeExporter commented 9 years ago

The problem of the 27 MB of JSON data was caused by cron jobs that simply 
wanted to append data to existing Activities. Instead of appending, the 
DAPloaders.py code was creating a new Activity each time it was run. 

Suggest investigating adding a command line option for scripts that are run to 
append data to existing Activities.

-Mike

Original comment by MBARIm...@gmail.com on 28 Aug 2014 at 6:27

GoogleCodeExporter commented 9 years ago

Core to this issue is the overloading of the use of startDatetime and 
endDatetime in the loaders. These values are used to uniquely identify 
Activities AND to restrict the time domain of data to be loaded into the 
Activity. This works fine if the load is just a one-time load (as happens 
following a campaign). 

If the startDatetime and endDatetime values are incremented with the 
periodicity of the load execution for realtime data (as was done last Fall) 
then a new Activity is created for each execution. In this case we would like 
to use a different parameter to specify the start (and maybe the end) time for 
the DATA to be loaded. The Base_Loader class 
(https://code.google.com/p/stoqs/source/browse/loaders/DAPloaders.py) has a 
dataStartDatetime parameter, but it does not appear to be implemented. 

I will begin testing an implementation.

Original comment by MBARIm...@gmail.com on 8 Sep 2014 at 6:43

Changed state: Started

GoogleCodeExporter commented 9 years ago

This change set: 
https://code.google.com/p/stoqs/source/detail?r=ce8f00cd56bfe27b9f402bdca619edcb
300d82b8 implements a --append option for all loaders that extend LoadScript(). 

It's being tested with an hourly cron job on loading new data from 
http://dods.mbari.org/opendap/data/ssdsdata/deployments/m1/201407/OS_M1_20140716
hourly_CMSTV.nc.html.

There is a problem with only missing datavalues being loaded (well, they aren't 
being loaded) with InstantPoints being inserted making the loader think that 
those data have been loaded. I suspect that this is being caused by this netCDF 
file containing multiple grids for the met, ts, and adcp data and that this 
code:

        # Deliver the data harmonized as rows as an iterator so that they are fed as needed to the database
        for pname in data.keys():
            logger.info('Delivering rows of data for %s', pname)
            l = 0
            import pdb
            pdb.set_trace()
            for depthArray in data[pname]:
                k = 0
                logger.debug('depthArray = %s', depthArray)
                logger.debug('nomDepths = %s', nomDepths)
                ##raw_input('PAUSED')
                values = {}
                for dv in depthArray:
                    values[pname] = float(dv)
                    values['time'] = times[pname][l]
                    values['depth'] = depths[pname][k]
                    values['latitude'] = latitudes[pname]
                    values['longitude'] = longitudes[pname]
                    values['timeUnits'] = timeUnits[pname]
                    try:
                        values['nomDepth'] = nomDepths[pname][k]
                    except IndexError:
                        values['nomDepth'] = nomDepths[pname]
                    values['nomLat'] = nomLats[pname]
                    values['nomLon'] = nomLons[pname]
                    yield values
                    k = k + 1

in DAPloaders _getTimeSeriesGridType() has some assumptions that didn't 
anticipate different grids in the same file.

Original comment by MBARIm...@gmail.com on 11 Sep 2014 at 4:47

GoogleCodeExporter commented 9 years ago

I examined that code. The logic appears to be correct for dealing with multiple 
grids in the same file. Need to do more testing...

Original comment by MBARIm...@gmail.com on 11 Sep 2014 at 10:00

GoogleCodeExporter commented 9 years ago

It appears that the missing value is in the source data, e.g. for the last 2 
values of air_temperature now:

http://dods.mbari.org/opendap/hyrax/data/ssdsdata/deployments/m1/201407/OS_M1_20
140716hourly_CMSTV.nc.ascii?hr_time_met[1373:1:1374],AIR_TEMPERATURE_HR[1373:1:1
374][0:1:0][0:1:0][0:1:0]

Dataset: OS_M1_20140716hourly_CMSTV.nc
hr_time_met, 1410481800, 1410485400
AIR_TEMPERATURE_HR.Longitude, -122.030275
AIR_TEMPERATURE_HR.AIR_TEMPERATURE_HR[AIR_TEMPERATURE_HR.hr_time_met=1410481800]
[AIR_TEMPERATURE_HR.HR_DEPTH_met=-2.5][AIR_TEMPERATURE_HR.Latitude=36.756775], 
15.6817
AIR_TEMPERATURE_HR.AIR_TEMPERATURE_HR[AIR_TEMPERATURE_HR.hr_time_met=1410485400]
[AIR_TEMPERATURE_HR.HR_DEPTH_met=-2.5][AIR_TEMPERATURE_HR.Latitude=36.756775], 
-1e+34

When the load script runs hourly it is loading just the last value of this 
missing_value. The InstantPoint gets loaded, preventing the good value from 
being loaded the next hour. Hmmmm... what to do about this...

Original comment by MBARIm...@gmail.com on 12 Sep 2014 at 3:21

GoogleCodeExporter commented 9 years ago

There are missing_values (or _FillValues) because there may be ADCP or TS data 
in that time cell, or vice versa. Let's try changing the dataStartDatetime 
value to one hour less than the last InstantPoint timevalue in the database. 
This should fill in the good data values when they come in. There will be some 
database warnings for attempts to load MeasuredParameters that already exist.

Original comment by MBARIm...@gmail.com on 12 Sep 2014 at 3:27

GoogleCodeExporter commented 9 years ago

That was it! Changing the loadM1() method in loaders/CANON/__init__.py to 
subtract an hour from the last time in the database now is loading new data:

if self.args.append:
                # Return datetime of last timevalue - if data are loaded from multiple activities return the earliest last datetime value
                dataStartDatetime = InstantPoint.objects.using(self.dbAlias).filter(activity__name=aName).aggregate(Max('timevalue'))['timevalue__max']
                if dataStartDatetime:
                    # Subract an hour to fill in missing_values at end from previous load
                    dataStartDatetime = dataStartDatetime - timedelta(seconds=3600)

You can observe the results (updated hourly) at 
http://kraken.shore.mbari.org/canon/stoqs_september2014/ (internal to MBARI).

Now to confirm this works for lrauv data...

Original comment by MBARIm...@gmail.com on 12 Sep 2014 at 8:47

GoogleCodeExporter commented 9 years ago

Still working on reworking the monitorLrauv.py code to work with current data. 

Observed a problem with the mooring data load that has been running on kraken 
for about a week. It seems that extra records that aren't needed are being 
added to the simpledepthtime table. This is increasing the size of the json in 
the query/summary response and giving artifacts in the Temporal/Depth Flot 
plot. 

Need to investigate updating the last time for each depth in the 
simpledepthtime table rather than blindly inserting new records...

Original comment by MBARIm...@gmail.com on 19 Sep 2014 at 4:25

GoogleCodeExporter commented 9 years ago

Regarding last comment, changeset 
https://code.google.com/p/stoqs/source/detail?r=35b181ca3fa29960c01b60d7e8e99744
00e28984 implements a better update of the SimpleDepthTime table for when 
timeSeries and timeSeriesProfile data are being appended. The UI response time 
is much better now: it went from about 8 seconds to less than 0.5 seconds.

Original comment by MBARIm...@gmail.com on 23 Sep 2014 at 4:41

GoogleCodeExporter commented 9 years ago

Committed changes to 
https://code.google.com/p/stoqs/source/browse/loaders/CANON/realtime/monitorLrau
v.py and it appears to be running fine with the dataStartDatetime being set to 
the last time for the Activity in the database.

Marking this issue as fixed.

Original comment by MBARIm...@gmail.com on 24 Sep 2014 at 4:53

Changed state: Fixed

google-code-export / stoqs

Allow loaders to append data to an existing activity #28