turbot / steampipe-plugin-googledirectory

Use SQL to instantly query users, groups, domains and more from Google Directory. Open source CLI. No DB required.
https://hub.steampipe.io/plugins/turbot/googledirectory
Apache License 2.0
11 stars 1 forks source link

Speed up fetching group members by listing members concurrently #88

Open laurensknoll opened 2 months ago

laurensknoll commented 2 months ago

Is your feature request related to a problem? Please describe. When querying all googledirectory_group_member using the next query, the members are not queried in parallel, making it slow:

        select
            g.customer_id as customer_id,
            g.id as group_id,
            g.name as group_name,
            m.id as member_id,
            m.type as member_type,
            m.email as member_email,
            m.status as member_status
        from googledirectory.googledirectory_group as g
        join googledirectory.googledirectory_group_member as m on m.group_id = g.id
        where g.customer_id = 'my_customer_id'

Describe the solution you'd like The preferred solution executes the googledirectory_group_member.list-methods in parallel, using a configurable concurrency to reach the Admin SDK rate limit of 2400 req/min.

Describe alternatives you've considered

Additional context In the plugin log, the group_member-calls are listed sequentially:

2024-04-15 14:57:50.951 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1713193070432: Plugin execute table name: googledirectory_group quals: customer_id &{=} string_value:"...",  (1713193070432)
...
2024-04-15 14:57:53.681 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1713193073235: Plugin execute table name: googledirectory_group_member quals: group_id &{=} string_value:"...",  (1713193073235)
2024-04-15 14:57:53.963 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1713193073115: Plugin execute table name: googledirectory_group_member quals: group_id &{=} string_value:"...",  (1713193073115)
2024-04-15 14:57:54.294 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1713193074677: Plugin execute table name: googledirectory_group_member quals: group_id &{=} string_value:"...",  (1713193074677)
2024-04-15 14:57:54.575 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1713193074546: Plugin execute table name: googledirectory_group_member quals: group_id &{=} string_value:"...",  (1713193074546)
...

And, consequently, the Admin SDK API monitoring lists 1 to 2 req/second.

ParthaI commented 2 months ago

Hello @laurensknoll,

Thank you for bringing this issue to our attention.

I've reviewed the current code and noted that when querying the googledirectory_group_member table, the group_id parameter is essential for calling the List Members API.

I've implemented some updates in the issue-88 branch to enhance how members are listed concurrently:

Updated Table Behavior in issue-88:

I have modified the query plan to make it faster. The behavior of the query(mentioned by you in the issue description) will remain the same as we are making the API call by passing the Group ID.

WITH GroupDetails AS (
    SELECT
        id AS group_id,
        customer_id,
        name AS group_name
    FROM
        googledirectory.googledirectory_group
    WHERE
        customer_id = 'my_customer_id'
),
MemberDetails AS (
    SELECT
        id AS member_id,
        group_id,
        type AS member_type,
        email AS member_email,
        status AS member_status
    FROM
        googledirectory.googledirectory_group_member
)
SELECT
    g.customer_id,
    g.group_id,
    g.group_name,
    m.member_id,
    m.member_type,
    m.member_email,
    m.member_status
FROM
    GroupDetails g
    JOIN MemberDetails m ON m.group_id = g.group_id;

Please try pulling the changes from issue-88, build the plugin locally, and run the above query. Your feedback would be greatly appreciated.

Steps to build the plugin locally with the changes from issue-88:

Looking forward to your feedback. Thanks again!

ParthaI commented 2 months ago

Hello, @laurensknoll, Did you get an opportunity to give it a try?

laurensknoll commented 2 months ago

Hi @ParthaI , Thanks for preparing the improvement. I've run both version to check performance. However, more interesting, the improvement branch returns different results. This is not expected, because the WHERE direct_member_count > 0 should filter out all the groups without members.

Query

query "google_group_members" {
    sql = <<-EOQ
        select
            googledirectory_group.customer_id as customer_id,
            googledirectory_group.id as group_id,
            googledirectory_group.name as group_name,
            googledirectory_group_member.id as member_id,
            googledirectory_group_member.type as member_type,
            googledirectory_group_member.email as member_email,
            googledirectory_group_member.status as member_status
        from googledirectory.googledirectory_group
        left join googledirectory.googledirectory_group_member
            on googledirectory_group_member.group_id = googledirectory_group.id
        where googledirectory_group.customer_id = $1
          and googledirectory_group.direct_members_count > 0
        limit 200
    EOQ

    param "customer_id" {
        description = "$1, Google Workspace customer ID."
    }
}

Performance

Old:

time steampipe query 'monitoring.query.google_group_members(customer_id => “…”)’
steampipe query   7.70s user 1.98s system 12% cpu 1:16.63 total

New:

time steampipe query 'monitoring.query.google_group_members(customer_id => “…”)’
steampipe query   18.63s user 3.55s system 14% cpu 2:29.76 total

Results

Old:

+-------------+-----------------+-------------------------+-----------------------+-------------+
| customer_id | group_id        | group_name              | member_id             | member_type |
+-------------+-----------------+-------------------------+-----------------------+-------------+
| ...   | 00lnxb... | group1 | 10365.. | USER        |
| ...   | 00lnxb... | group1 | 11481.. | USER        |
| ...   | 00lnxb... | group1 | 11315.. | USER        |
| ...   | 00ihv6... | group2 | 041mg.. | GROUP       |
| ...   | 02szc7... | group3 | 01gf8.. | GROUP       |
| ...   | 02p2cs... | group4 | 040ew.. | GROUP       |

New:

+-------------+-----------------+-------------------------+-----------+-------------+--------------+---------------+
| customer_id | group_id        | group_name              | member_id | member_type | member_email | member_status |
+-------------+-----------------+-------------------------+-----------+-------------+--------------+---------------+
| ...   | 00ln.. | group1  | <null>    | <null>      | <null>       | <null>        |
| ...   | 02p2.. | group2  | <null>    | <null>      | <null>       | <null>        |
| ...   | 04an.. | group3  | <null>    | <null>      | <null>       | <null>        |
| ...   | 01x0.. | group4  | <null>    | <null>      | <null>       | <null>        |
| ...   | 02sz.. | group5  | <null>    | <null>      | <null>       | <null>        |
| ...   | 049x.. | group6  | <null>    | <null>      | <null>       | <null>        |
| ...   | 01mr.. | group7  | <null>    | <null>      | <null>       | <null>        |
| ...   | 00ih.. | group8  | <null>    | <null>      | <null>       | <null>        |
| ...   | 025b.. | group9  | <null>    | <null>      | <null>       | <null>        |
| ...   | 049x.. | group10 | <null>    | <null>      | <null>       | <null>        |
| ...   | 02lw.. | group11 | <null>    | <null>      | <null>       | <null>        |
ParthaI commented 2 months ago

@laurensknoll Thank you for testing the PR branch. I've committed another update to resolve the output result discrepancies for the specified query. In the enhancement branch, we have structured the table to enable parallel API calls by configuring the ParentHydrate within the table definition.

The table will retrieve all groups in your workspace, then it will execute parallel API calls for each group to fetch the group members.

Have you tried the following query:

WITH GroupDetails AS (
    SELECT
        id AS group_id,
        customer_id,
        name AS group_name
    FROM
        googledirectory.googledirectory_group
    WHERE
        customer_id = 'my_customer_id'
),
MemberDetails AS (
    SELECT
        id AS member_id,
        group_id,
        type AS member_type,
        email AS member_email,
        status AS member_status
    FROM
        googledirectory.googledirectory_group_member
)
SELECT
    g.customer_id,
    g.group_id,
    g.group_name,
    m.member_id,
    m.member_type,
    m.member_email,
    m.member_status
FROM
    GroupDetails g
    JOIN MemberDetails m ON m.group_id = g.group_id;

which is equivalent to

   select
            g.customer_id as customer_id,
            g.id as group_id,
            g.name as group_name,
            m.id as member_id,
            m.type as member_type,
            m.email as member_email,
            m.status as member_status
        from googledirectory.googledirectory_group as g
        join googledirectory.googledirectory_group_member as m on m.group_id = g.id
        where g.customer_id = 'my_customer_id'

Could you please pull the latest changes from the issue-88 branch, build the plugin locally, and then execute the query? Let us know if you encounter any discrepancies in data between the old and new changes.

Thanks!

laurensknoll commented 2 months ago

Hi @ParthaI , thanks for the update.

When running the query, many 404's are returned by the API. These seem to be caused by the ParentHydrate query for the group (this API requires either customer_id or domain to be configured). Can you take a look at that using the additional evidence below.

BTW In terms of concurrency, the number of requests seems to stick at ~4 req/sec. We should strive for 30 req/sec or so.

2024-04-30 12:24:46.206 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1714479886454: calling fetchItems, table: googledirectory_group_member, matrixItem: [], limit: -1  (googledirectory-1714479886454)
2024-04-30 12:24:46.206 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1714479886454: QueryData streamRows (googledirectory-1714479886454)
2024-04-30 12:24:46.206 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1714479886454: setRequestSubscriber readRowsAsync publisher googledirectory-1714479886454 (googledirectory-1714479886454)
2024-04-30 12:24:46.209 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1714479886454: initialiseRateLimiters for query data 0x14000216780 (googledirectory-1714479886454)
2024-04-30 12:24:46.209 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1714479886454: getHydrateCallRateLimiter (googledirectory-1714479886454)
2024-04-30 12:24:46.209 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1714479886454: resolveRateLimiterScopeValues (googledirectory-1714479886454)
2024-04-30 12:24:46.209 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1714479886454: HydrateCall tags map[function_name:listDirectoryGroups]
2024-04-30 12:24:46.209 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1714479886454: Table tags map[table:googledirectory_group_member]
2024-04-30 12:24:46.209 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1714479886454: QueryData rateLimiterScopeValues map[connection:googledirectory]
2024-04-30 12:24:46.209 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1714479886454: merged scope values map[connection:googledirectory function_name:listDirectoryGroups table:googledirectory_group_member]
2024-04-30 12:24:46.209 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1714479886454: resolvedRateLimiterConfig: no rate limiters (googledirectory-1714479886454)
2024-04-30 12:24:46.209 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1714479886454: getHydrateCallRateLimiter (googledirectory-1714479886454)
2024-04-30 12:24:46.209 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1714479886454: resolveRateLimiterScopeValues (googledirectory-1714479886454)
2024-04-30 12:24:46.209 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1714479886454: HydrateCall tags map[function_name:listDirectoryGroupMembers]
2024-04-30 12:24:46.209 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1714479886454: Table tags map[table:googledirectory_group_member]
2024-04-30 12:24:46.209 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1714479886454: QueryData rateLimiterScopeValues map[connection:googledirectory]
2024-04-30 12:24:46.209 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1714479886454: merged scope values map[connection:googledirectory function_name:listDirectoryGroupMembers table:googledirectory_group_member]
2024-04-30 12:24:46.209 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1714479886454: resolvedRateLimiterConfig: no rate limiters (googledirectory-1714479886454)
2024-04-30 12:24:46.383 UTC [WARN]  steampipe-plugin-googledirectory.plugin: [WARN]  1714479886454: doList callHydrateWithRetries (googledirectory-1714479886454) returned err googleapi: Error 404: Domain not found., notFound
2024-04-30 12:24:46.383 UTC [WARN]  steampipe-plugin-googledirectory.plugin: [WARN]  1714479886454: QueryData StreamError googleapi: Error 404: Domain not found., notFound (googledirectory-1714479886454)
2024-04-30 12:24:46.383 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1714479886454: rowData chan returned nil - wait for rows to complete (googledirectory-1714479886454)
2024-04-30 12:24:46.383 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1714479886454: buildRowsAsync goroutine returning (googledirectory-1714479886454)
2024-04-30 12:24:46.383 UTC [INFO]  steampipe-plugin-googledirectory.plugin: [INFO]  1714479886454: QueryData streamRows DONE (googledirectory-1714479886454)
image
ParthaI commented 1 month ago

Sorry @laurensknoll for being radio silent on this matter for a while, I will give it a shot and let you know. Thanks!