maris-development / beacon-blue-cloud

2 stars 0 forks source link

CORA TS - failed to process query with the filter: "use_specific_dataset" #5

Open geofrizz opened 1 month ago

geofrizz commented 1 month ago

With the API it's possible to download the metadata for a specific dataset id

e.g.: https://beacon-cora-ts.maris.nl/api/datasets/dataset-metadata/64282

{ "id": 64282, "name": "datasets/2018/CO_DMQCGL01_20180520_TS_MO.nc", "tensors": { "CONFIG_MISSION_NUMBER": { "name": "CONFIG_MISSION_NUMBER", "tensor_range": { ...

If you try to download the data with the following query:

  "query_parameters": [
    {
      "column_name": "TEMP",
      "alias": "Temperature [celsius]"
    },
    {
      "column_name": "JULD",
      "alias": "Julian Date"
    },
    {
      "column_name": "DEPH",
      "alias": "Depth [meter]"
    },
    {
      "column_name": "LATITUDE",
      "alias": "Latitude [degrees_north]"
    },
    {
      "column_name": "LONGITUDE",
      "alias": "Longitude [degrees_east]"
    },
    {
      "column_name": "PLATFORM_NUMBER",
      "alias": "PLATFORM_NUMBER"
    }
  ],
  "filters": [
    {
        "use_specific_datasets": [ 64282 ]
    },
  ],
  "output": {
    "format": "netcdf"
  }
}

The result message is:

"Failed to process query: Failed to process query: Memory limit exceeded. Limit: 17179869184."

Probably the filter doesn't work because the original netcdf file from copernicus (CO_DMQCGL01_20180520_TS_MO.nc) is only 270.2 KB and I think that doesn't exceed the limit !! :-)

I used the filter in the wrong manner ?

robinskil commented 1 month ago

Hi,

Thanks for the detailed explanation. I can confirm that it indeed shouldn't fail. I will have to check if it is something with the filter. (It could be that the specific datasets filters is not correctly being applied somehow by Beacon)

Cheers, Robin

geofrizz commented 1 month ago

Hi Robin, today I can add other details, I apply the follow query:

  "query_parameters": [
    {
      "column_name": "TEMP",
      "alias": "Temperature [celsius]"
    },
    {
      "column_name": "JULD",
      "alias": "Julian Date"
    },
    {
      "column_name": "DEPH",
      "alias": "Depth [meter]"
    },
    {
      "column_name": "LATITUDE",
      "alias": "Latitude [degrees_north]"
    },
    {
      "column_name": "LONGITUDE",
      "alias": "Longitude [degrees_east]"
    },
    {
      "column_name": "PLATFORM_NUMBER",
      "alias": "PLATFORM_NUMBER"
    }
  ],
  "filters": [
    {
        "use_specific_datasets": [ 64282 ]
    },
    {
        "for_query_parameter": "Longitude [degrees_east]",
        "min": 2.50,
        "max": 10.0
    },
    {
        "for_query_parameter": "Latitude [degrees_north]",
        "min": 41.5,
        "max": 44.5
    },
    {
        "for_query_parameter": "Depth [meter]",
        "min": 0,
        "max": 5000,
    },
    {
        "for_query_parameter": "Julian Date",
        "min": 24976,
        "max": 24976.999305555557,
    }
  ],
  "output": {
    "format": "netcdf"
  }
}

I tried to recreate the same area with a bbox and datetime filter (latitude & longitude, juld).

Two datasets in the result (dataset_id: 64284, 64282)

Some problem with the depth: my request 0 - 5000, the answer is 0 - 2316.0; the metadata report 0 - 2565.468017578125

I checked the Platform_Number, also in this case there are some inconsistencies, in the beacon extraction are present some new Platform_Number (this is correct, there are 2 dataset_id) but some Platform_Number present in copernicus are missing:

BEACON ,COPERNICUS , FGTO ,none FNCM ,none none ,EXIN0002 EXIN0004 ,EXIN0004 none ,IF000584 6100001 ,6100001 6100002 ,6100002 6100021 ,6100021 6100022 ,6100022 6100188 ,6100188 6100189 ,none 6100190 ,6100190 6100191 ,6100191 6100284 ,6100284 none ,6100289 6100294 ,6100294 6100295 ,6100295 6100431 ,6100431 6800418 ,6800418 6801015 ,6801015

At the end the amount of data are really small: 3425 rows (beacon) vs 407680 rows (copernicus)

Copernicus data: Screenshot from 2024-10-28 10-29-12

beacon data: Screenshot from 2024-10-28 10-30-38

I hope that these new details can help you!

Ciao, Paolo

robinskil commented 4 weeks ago

Hi Paolo,

I will take a look at the Platform_Number differences. As soon as i know more, i will let you know.

Regarding the number of rows returned for the query, it could be because when filtering on a column containing fill values, beacon will automatically remove these records as well. So when filtering on the depth columns, it would remove records where the depth value is a fill value.

Cheers, Robin

robinskil commented 4 weeks ago

Hi Paolo,

I've quickly checked, if you remove the depth filter, you will get back the correct Platform_Number records. It could be that the timeseries for these platforms, don't have any valid depth values (only fill values) and thus would be removed when applying a depth filter.

Cheers, Robin