pubpub / platform

Open-source technology for creating full-stack knowledge applications for communities of all types.
https://knowledgefutures.org/pubpub
GNU General Public License v2.0
21 stars 2 forks source link

feat: add new recursive pub fetch query #746

Closed tefkah closed 1 week ago

tefkah commented 2 weeks ago

Issue(s) Resolved

Resolves #734 I want to do another pass which includes using this function in various parts of the app/api, ideally in a bunch of separate PRs.

High-level Explanation of PR

Adds a new "god" function for fetching pubs, which is able to fetch related pubs and children, fetch all the same information as the old getPub(s) functions, plus a little more.

I do not intend us to use this function most of the time in components, but instead have some more specific functions like getPubsForCommunity, or getPub or something which use this function with some default settings.

You can still fetch pubs either by their id, pubTypeId, stageId, or communityId, and limit, offset, order by the results. There are a number of intersting differences however, which I will go through below.

Differences with old getPubs function

SQL

This query should be somewhat more efficient than the old getPubs function, specifically the withChildren part.

As we can see here, the old function fetches a bunch of columns, the values, and the pubType in the recursive part of the query from the pubs table

https://github.com/pubpub/platform/blob/d7d5ff8eb0b4d7bd36ac226b238efe0396777019/core/lib/server/pub.ts#L171-L205

However, in getPubBase, instead of retrieving this information from the children CTE we just created, we do another look through the pubs table, only getting the children from the children CTE

https://github.com/pubpub/platform/blob/d7d5ff8eb0b4d7bd36ac226b238efe0396777019/core/lib/server/pub.ts#L236-L274

This is kind of inefficient, as this requires an extra excursion through the pubs table.

Instead in the new query, we already retrieve all values and most other data in the recursive part of the query, and then select into the CTE we just fetched instead

https://github.com/pubpub/platform/blob/d7d5ff8eb0b4d7bd36ac226b238efe0396777019/core/lib/server/pub.ts#L982-L992

While solely gut-feeling based, I feel like this is probably more efficient. We fetch (almost) all the columns we need in the recursive part, and then just kind of clean that up at the end, rather than doing two big fetching steps.

Input

You can no longer specify specifc columns of the pubs table to fetch. I did not think this functionality was particularly useful, as most of those colmuns you either almost always want, or almost never want.

Output

Support for the assignee is dropped, as we want to transition away from this. Let me know if this is still required.

The output is mostly the same. The main difference is that I opted to fetch the pubValues as an array of values, rather than as an object of slugs with values. This is because we now fetch relatedPubs, which would force some ugly modifications to the current object in a number of ways:

  1. We would need a { value: 'relationshipValue', relatedPub: { ... } } value for slugs of fields which are relationships, as we cannot simply either attach the value of the relationship field or just the relatedPub, as both constitute the relationship.
  2. Relationship values would need to be arrays, as a single relationship field can be attached to multiple pubs.

So the values object would have to be typed in any of the following ways

// keeping the normal values the same
type Values = {
    [slug in string]: unknown | { value: unknown, relatedPub: Pubs }[]
}

// also making the normal values an object, only relationship values an array
type Values = {
    [slug in string]: { value: unknown } | { value: unknown, relatedPub: Pubs }[]
}

// make everything an array, even though normal pub values are always singular
type Values = {
    [slug in string]: { value: unknown, relatedPub?: Pubs }[]
}

I thought neither of these options very good, and since we almost always convert the values object to an array anyway, and we already talked about making the values an array previously, I decided to go ahead with it.

Currently the values have some information about the field they belong to on the field property, like so (ignore the MaybeWithRelatedPub thingy)

https://github.com/pubpub/platform/blob/d7d5ff8eb0b4d7bd36ac226b238efe0396777019/core/lib/server/pub.ts#L749-L766

Maybe this does not need to be a nested object, e.g. schemaName and slug could just be properties on this value object like so

    values: ({
        id: PubValuesId;
        fieldId: PubFieldsId;
        value: unknown;
        relatedPubId: PubsId | null;
        createdAt: Date;
        updatedAt: Date;
        /**
         * Information about the field that the value belongs to.
         */
        schemaName: CoreSchemaType;
        slug: string;
    } & MaybeWithRelatedPub<Options>)[];

Let me know what you prefer!

Types

I made it so that the output type of the new function depends on the input parameters in a number of ways.

  1. If pubId is used the "where" clause, return a single Pub. Otherwise return an array. This is accomplished through function overloads here.

https://github.com/pubpub/platform/blob/d7d5ff8eb0b4d7bd36ac226b238efe0396777019/core/lib/server/pub.ts#L789-L811

Theoretically this could also be accomplished in the generic return type, as the others are. I implemented the function overloads before the generic return type, hence why it's like this. If you'd rather me make all type info be defined through the generic return type, let me know. That is a bit cleaner.

  1. The withX options. I think a table would be most useful. Here the type in the true and false column represent what the output type matches with when the option is true or false. The default column the default setting for the option.
Option true false default
withChildren { children: ProcessedPub<Options>[] } { children?: undefined } true
withRelatedPubs { values: { value: unknown, relatedPub: ProcessedPub<Options> }[] } { values: { value: unknown }[] } true
withPubType { pubType: ProcessedPubType } { pubType?: undefined } false
withStage { stage: ProcessedPubStage } { stage?: undefined } false

I think this is nice, as it makes it harder to make mistakes when fetching stuff. However you could argue it makes things more complicated and that this specific type information should be left to more specific functions which use this "god" function. Let me know what you think!

Other functionality

Max depth

This function is able to fetch a pub and all its children/related pubs up to a given depth. By default it is set to 2, so it will fetch the pub and all its children and related pubs, but not their children or related pubs.

This is done by incrementing a depth column in the CTE here

https://github.com/pubpub/platform/blob/d7d5ff8eb0b4d7bd36ac226b238efe0396777019/core/lib/server/pub.ts#L962-L963

and then limiting the results with a where clause here

https://github.com/pubpub/platform/blob/d7d5ff8eb0b4d7bd36ac226b238efe0396777019/core/lib/server/pub.ts#L973

Cycle detection

In addition to the depth setting, this query also has built in cycle detection, and will not get stuck in an infinite loop even if there are cycles in the pub tree, even if depth is set to Infinity (which we probably should not allow in the first place).

This is done by keeping track of the path we've taken so far in the path column and setting a flag if we see a pub that is already in the path.

https://github.com/pubpub/platform/blob/d7d5ff8eb0b4d7bd36ac226b238efe0396777019/core/lib/server/pub.ts#L964-L971

We then filter those out here:

https://github.com/pubpub/platform/blob/d7d5ff8eb0b4d7bd36ac226b238efe0396777019/core/lib/server/pub.ts#L973-L974

The query also allows you to specify whether the cycle detection should be inclusive or exclusive, meaning whether the first pub that is detected to be circular should be included or excluded.

This allows you to e.g. show the user that the child of this pubs child is the same as this pub more easily, as you can just look at the path or isCycle flag on the pub. This is the default behavior. If it's set to exclude, you will not see the first pub that is part of a cycle.

Field slugs

This query allows you to specify a list of field slugs to fetch, and will only fetch the values of the fields that match the slugs.

This is done here:

https://github.com/pubpub/platform/blob/d7d5ff8eb0b4d7bd36ac226b238efe0396777019/core/lib/server/pub.ts#L944-L946

Note that this currently does not make any distinction between relationship fields and other fields. So if you set fieldSlugs to ['title', 'description'], you will get all the values of the title and description fields, but will not get the values or related pubs of the relationship fields.

An improvement here would be to allow the user to specify which fields to exclude, rather than only which to include.

Note that this also applies to the children and related pubs, so if you specify fieldSlugs to fetch, you will only get those fields for the pub itself and its children/related pubs. Maybe this could be improved, but I don't really know if we intend to do ever do this, or what a good API would be for specifying this. E.g. how would you specify "give me all fields for the top level pubs, but only the title and description fields for the children/related pubs"?

Excluding children and related pubs

This query allows you to specify whether you want to exclude children or related pubs. This is done by only running the recursive part of the CTE if either withChildren or withRelatedPubs is true (which are the default settings).

And by not joining the pubs table on parentId or relatedPubId if withChildren or withRelatedPubs is false respectively.

https://github.com/pubpub/platform/blob/d7d5ff8eb0b4d7bd36ac226b238efe0396777019/core/lib/server/pub.ts#L906-L933

Searching

This has been scrapped for now, I'll add this back in when #750 lands.

~I have also included (incredibly primitive) searching functionality on the Pubs. This is only done on the top level (so not on the children or related pubs).~

https://github.com/pubpub/platform/blob/d7d5ff8eb0b4d7bd36ac226b238efe0396777019/core/lib/server/pub.ts#L902-L904

~This forcebly converts the value of the fields to text and does a case-insensitive search.~

~In the future I would want to limit this to only isTitle fields of the type @allisonking is implementing in #750, or at least make it so that the user can specify which fields to search on.~

~Likely there's a much better way to do this, e.g. using Postgres' full text search. I have zero experience with this however.~

Future work

Test Plan

  1. Run the tests! I wrote quite a few extensive tests for this function.
    • Some are kind of large, which sometimes makes it a bit hard to identify why they failed. I could add some more granular tests, let me know if you'd like me to do that.

Screenshots (if applicable)

Vids

General structure

https://github.com/user-attachments/assets/b52fcf98-8565-4993-85dc-22db7f108618

Deep dive into the recursive part of the query

https://github.com/user-attachments/assets/854e1d85-fbc7-4d28-aa51-e42dfd173414

https://github.com/user-attachments/assets/a6e5c15e-8170-4825-8864-3dabfe754eca

Notes

tefkah commented 1 week ago

Really nice work!

I am thinking about how to make it easy to query whether the user has read/write capabilities for the pub entirely inside the db and I think we can do it. But I think the best way to achieve that is to store the mapping between role/{pub,stage,community,form}_membership and a set of capabilities in a database table, rather than in the codebase somewhere. I'll keep it in mind as I work on the capabilities queries!

Haven't looked at it at all, but maybe this is something? https://github.com/ben-pr-p/kysely-access-control/tree/main