osmlab / to-fix-backend

The to-fix server
BSD 3-Clause "New" or "Revised" License
15 stars 13 forks source link

Add `item_tags` filter to quadkeys end-point #244

Closed batpad closed 6 years ago

batpad commented 6 years ago

This one might be slightly tricky.

batpad commented 6 years ago

Ok, my hunch was correct - this turns out to be a rabbit hole.

We run into a couple of issues with how we are using Sequelize. First, let me back-up with what we want to do here:

We want to query items, filtered by certain attributes, and return:

The key point here is that we are grouping items on a slightly non-standard thing - i.e. not the primary key, but the substring value of one of its properties.

This works perfectly fine when filtering by attributes on the Item model itself, with the generated SQL looking something like:

SELECT COUNT(substring("quadkey", 1, 8)) AS "item_count", substring("quadkey", 1, 8) AS "quadkey" FROM "items" AS "item" WHERE "item"."project_id" = '00000000-0000-0000-0000-000000000000' AND "item"."quadkey" LIKE '0000%' AND "item"."lockedTill" < '2017-11-23 08:26:55.766 +00:00' GROUP BY substring("quadkey", 1, 8);

The Sequelize code to generate the above query would look something like:

Item.findAll({
{
    attributes: [
      [
        Sequelize.fn(
          'COUNT',
          Sequelize.fn('substring', Sequelize.col('quadkey'), 1, zoomLevel)
        ),
        'item_count'
      ],
      [
        Sequelize.fn('substring', Sequelize.col('quadkey'), 1, zoomLevel),
        'quadkey'
      ]
    ],
    where: {
      'lockedTill': request.query.lockedTill
    },
    group: [Sequelize.fn('substring', Sequelize.col('quadkey'), 1, zoomLevel)]
  }
});

^^ Here we get a count of all items within the quadkey 0000, filtered by items with a lockedTill attribute less than '2017-11-23 08:26:55.766`. This works great.

When we want to filter by an attribute on a related model, i.e. Tags, Sequelize tells us that we need to include that model, to get it to generate the required JOIN SQL. We'd add something like this to our Sequelize code above:

search.include = [
      {
        model: db.Tag,
        attributes: [],
        through: {
          attributes: []
        },
        as: 'tags',
        where: {
          id: {
            [Op.in]: req.query.item_tags.split(',')
          }
        }
      }
    ];

This, should, in theory, filter our item counts to count only items with tags specified in the item_tags query parameter. The first problem is that Sequelize adds all the columns of the join tables, which makes it impossible for us to then GROUP BY only quadkey substring (SQL requires any fields that are not aggregations in the SELECT clause to be present in the GROUP BY clause.). This ticket explains this problem: https://github.com/sequelize/sequelize/issues/5481

There is a weird workaround where one can add includeIgnoreAttributes = false; to the search object, and then this no longer includes all attributes from the join table. However, since we are calling the findAll method on the Item model, it ALWAYS attempts to include the primary key of the Item model in the SELECT clause. And this seems to be by design and wontfix on the Sequelize end: https://github.com/sequelize/sequelize/issues/1468

So, as a result, using the Sequelize ORM, the SQL generated for this we get is:

SELECT "item"."auto_id", COUNT(substring("quadkey", 1, 8)) AS "item_count", substring("quadkey", 1, 8) AS "quadkey" FROM "items" AS "item" INNER JOIN ( "item_tag" AS "tags->item_tag" INNER JOIN "tags" AS "tags" ON "tags"."id" = "tags->item_tag"."tagId") ON "item"."auto_id" = "tags->item_tag"."itemAutoId" AND "tags"."id" IN ('11111111-1111-1111-1111-111111111111') WHERE "item"."project_id" = '00000000-0000-0000-0000-000000000000' AND "item"."quadkey" LIKE '0000%' GROUP BY substring("quadkey", 1, 8);

^ Which, of course, is invalid and results in an error like: ERROR: column "item.auto_id" must appear in the GROUP BY clause or be used in an aggregate function

If we modify the above SQL to be just:

SELECT COUNT(substring("quadkey", 1, 8)) AS "item_count", substring("quadkey", 1, 8) AS "quadkey" FROM "items" AS "item" INNER JOIN ( "item_tag" AS "tags->item_tag" INNER JOIN "tags" AS "tags" ON "tags"."id" = "tags->item_tag"."tagId") ON "item"."auto_id" = "tags->item_tag"."itemAutoId" WHERE "item"."project_id" = '00000000-0000-0000-0000-000000000000' AND "item"."quadkey" LIKE '0000%' GROUP BY substring("quadkey", 1, 8);

^ This gives us what we want. Notice the missing "item.auto_id" from the SELECT clause.

I cannot see a way to do this using the Sequelize ORM, and the only way to construct the query we need seems to be generating the raw SQL string and passing it to the Sequelize .raw() method. This seems terrible and like something we want to avoid. Going to let this sit for a day or two to see if any better solutions emerge.

cc @kepta @emilymdubois @coxchapman @maning

batpad commented 6 years ago

An alternative solution here would be to re-model how we store tags in the database.

This seems like a roundabout way of working around the current issue, but I think is the only one which lands us with maintainable code in the long run. The proposal, thus, is:

This will involve changing / almost entirely re-writing the tags end-points. However, it's the only solution that does not involve some ghastly code to deal with the above issue.

batpad commented 6 years ago

After a bit of back and forth on this, and consulting with a few people, decided to just go the raw SQL route here.

In the end, it seemed a bad idea to re-model the database due to a limitation in an application-level library, and raw SQL seemed a reasonable solution. Also, going down the rabbit-hole of changing the db structure made me realize that it was a significant endeavour without any real benefit (and perhaps performance penalties down the line)

This is implemented here: https://github.com/osmlab/to-fix-backend/pull/251

coxchapman commented 6 years ago

Sounds like a good decision @batpad!