terraref / computing-pipeline

Pipeline to Extract Plant Phenotypes from Reference Data
BSD 3-Clause "New" or "Revised" License
22 stars 13 forks source link

Incomplete data in geostreams? #419

Open craig-willis opened 6 years ago

craig-willis commented 6 years ago

For the upcoming DataDrivenAg hackathon, I was trying to do the following:

The hope was to use the new sensorquery API, but this is still not fully working. Instead, I tried going to Geostreams directly:

Find the Geostreams sensor associated with a given Season 4 plot -- I picked something in the middle of the field:

https://terraref.ncsa.illinois.edu/clowder/api/geostreams/sensors?sensor_name=MAC%20Field%20Scanner%20Season%204%20Range%2029%20Column%207

I couldn't find a way to get the list of available streams (i.e., TERRA sensors) available for the plot. Instead, I had to retrieve the full list of streams and parse for the name (this seems undesirable):

curl --compressed "https://terraref.ncsa.illinois.edu/clowder/api/geostreams/streams"  |  jq -r ".[].name" | sed 's/(.*//g' | sort | uniq -c | sort -n
   1 Energy Farm Observations CEN
   1 Energy Farm Observations NE
   1 Energy Farm Observations SE
   1 Irrigation Observations
   1 Weather Observations
   3
 156 Laser Scanner 3D LAS Datasets
 509 scanner3DTop Datasets
3806 IR Surface Temperature
5372 RGB GeoTIFFs Datasets
5546 stereoTop Datasets
5559 Thermal IR GeoTIFFs Datasets
5678 flirIrCamera Datasets
6398 Canopy Cover

Now that I know the names of available streams, I can compose by combining name with Geostreams sensor_id (i.e., "RGB GeoTIFFs Datasets (1008)") :

https://terraref.ncsa.illinois.edu/clowder/api/geostreams/streams?stream_name=RGB%20GeoTIFFs%20Datasets%20(1008)
...
   "id": 9804
...

And finally I can make the leap to get the associated datapoints:

https://terraref.ncsa.illinois.edu/clowder/api/geostreams/datapoints?stream_id=9804

In this case, there are only 82 datapoints covering only a handful of season 4 dates:

curl --compressed "https://terraref.ncsa.illinois.edu/clowder/api/geostreams/datapoints?stream_id=9804" | jq -r ".[].start_time"  | sed 's/T.*//g' | sort | uniq -c
   6 2017-04-26
   8 2017-04-27
   7 2017-05-05
  11 2017-05-09
   7 2017-05-10
   9 2017-05-12
   9 2017-05-16
   6 2017-05-17
   7 2017-05-18
  11 2017-05-19
   6 2017-05-20
   1 2017-07-24
   1 2017-08-11

It seems that there is a large amount of information missing from Geostreams? I was expecting a minimum of one datapoint per day

max-zilla commented 6 years ago

We can discuss when you get in. The "Datasets" streams are all driven by sensorposition extractor which we haven't been running more than 3 at once while doing the full field processing. Currently there are ~6 million messages to process there: http://rabbitmq.ncsa.illinois.edu:15672/#/queues/clowder/terra.metadata.sensorposition

I would expect a number of the RGB Geotiffs, for example, are in that queue waiting to be submitted to Geostreams.