overture-stack / arranger

Data portal API and component generation
https://www.overture.bio/documentation/arranger/
GNU Affero General Public License v3.0
28 stars 24 forks source link

BUG - Aggs show 0 buckets incorrectly when some filters are applied #868

Open justincorrigible opened 5 months ago

justincorrigible commented 5 months ago

As identified in https://github.com/icgc-argo/roadmap/issues/1057, some filters produce 0 buckets inaccurately. this ticket will serve as documentation log for the research into this issue, and to link its eventual fix.

Thus far, the working theory is that there's something wrong with the aggs filtering for array nested fields, and specifically for "in" operations.

Example Using the Argo ticket, we can run a testing GraphQL query like this one, with no filters.

query ($SQON: JSON) {
  file {
    hits (filters: $SQON) {
      total
    } 
    aggregations(
      filters: $SQON
      include_missing: true
      aggregations_filter_themselves: true
    ) {
      donors__donor_id {
        bucket_count
        buckets {
          doc_count
          key
        }
      }
    }
  }
}

any anonymous user can see 1660 docs in the dev environment, as seen in the Arranger's GraphQL response:

{
  "data": {
    "file": {
      "hits": {
        "total": 1660
      },
      "aggregations": {
        "donors__donor_id": {
          "bucket_count": 6,
          "buckets": [
            {
              "doc_count": 877,
              "key": "DO250472"
            },
            {
              "doc_count": 478,
              "key": "DO253000"
            },
            {
              "doc_count": 163,
              "key": "DO35085"
            },
            {
              "doc_count": 138,
              "key": "DO252999"
            },
            {
              "doc_count": 3,
              "key": "DO250326"
            },
            {
              "doc_count": 1,
              "key": "DO250391"
            }
          ]
        }
      }
    }
  }
}

Now lets assume the following SQON:

{
  "content": {
    "fieldName": "donors.specimens.specimen_tissue_source",
    "value": "Solid tissue"
  },
  "op": "in"
}

Note: donors here, is technically an array of those, and so are specimens.

...which results in this response (aka the problem):

{
  "data": {
    "file": {
      "hits": {
        "total": 18
      },
      "aggregations": {
        "donors__donor_id": {
          "bucket_count": 0,
          "buckets": []
        }
      }
    }
  }
}

but then, if you turn the SQON to use a "not_in" operation, we get this correct response:

{
  "data": {
    "file": {
      "hits": {
        "total": 1642
      },
      "aggregations": {
        "donors__donor_id": {
          "bucket_count": 5,
          "buckets": [
            {
              "doc_count": 877,
              "key": "DO250472"
            },
            {
              "doc_count": 478,
              "key": "DO253000"
            },
            {
              "doc_count": 148,
              "key": "DO35085"
            },
            {
              "doc_count": 138,
              "key": "DO252999"
            },
            {
              "doc_count": 1,
              "key": "DO250391"
            }
          ]
        }
      }
    }
  }
}

Notice the totals are 1660 = 18 + 1642, which tracks with the fact that the SQONs are not entirely broken šŸ¤£

justincorrigible commented 5 months ago

Potentially related past tickets: