koltyakov / gosip

⚡️ SharePoint SDK for Go
https://go.spflow.com
MIT License
140 stars 32 forks source link

[Question/Advice] Search for all user profiles in a very large SPO environment? #79

Closed f0oster closed 2 days ago

f0oster commented 1 week ago

Cool project! I'm learning go currently and I have a few SharePoint issues at work to get through, so it seemed like a good time to check this out. :)

What is the right way to enumerate all user profiles within a very large SPO environment (400-500k users)?

My end goal is to build a tool that syncs SPO user profile / site UILs to a database that is optimized to be queried on demand to help streamline the clean up of orphaned users in our site UILs during user offboarding, as well as assist with cleaning/identifying mismatching user ids throughout our environment.

I was unable to page my searches using StartRow as it's seemingly capped at 50,000 rows as per the MS documentation. The MS documentation suggests this approach, which I've given a go (tried both DocId > {id} and IndexDocId > {id})

func fetch_user_profiles(sp *api.SP, DocId string, profiles []map[string]string) ([]map[string]string, error) {
    fmt.Printf("Fetching profiles starting from DocId: %s\n", DocId)
    res, err := sp.Search().PostQuery(&api.SearchQuery{
        QueryText:        fmt.Sprintf("DocId > %s AND *", DocId),
        RowLimit:         500,
        RowsPerPage:      500,
        Timeout:          10000,
        SelectProperties: []string{"AccountName", "UserProfileProperties", "DocId"},
        TrimDuplicates:   false,
        SourceID:         "b09a7990-05ea-4af9-81ef-edfab16c4e31",
        EnableSorting:    true,
        SortList: []*api.SearchSort{
            {
                Property:  "DocId",
                Direction: 1,
            },
        },
    })

    if err != nil {
        return nil, err
    }

    results := res.Results()
    if len(results) == 0 {
        return profiles, nil
    }

    profiles = append(profiles, results...)

    fmt.Printf("Total profiles fetched so far: %d\n", len(profiles))

    lastElement := results[len(results)-1]

    return fetch_user_profiles(sp, lastElement["DocId"], profiles)
}

I'm authed to the root site for the tenant currently.

The following returned 125250 profiles, which is a good start, but I have strong doubts that this is the correct number of profiles. I was expecting to see closer to ~500,000 profiles, based on how many licensed user accounts we have in Entra ID.

koltyakov commented 1 week ago

Hi @f0oster! Thanks for the interest in the library!

I'd suggest to try disabling TrimDuplicates and compare with and without.

UPD: I missed that you are already set TrimDuplicates to false.

When returning results in the UI for instance, what the number of estimated items the Search API returns?

It could end up limitations in SPO, which might need extra workarounds like, searching Profiles with a name starting with A, B, ... so on, if it can't chew half a million.

I guess you're syncing profiles from the Search to another DB to run some aggregations, etc. Maybe a sync then could contain a criteria for an increment based on update date to trim down results even more.

f0oster commented 1 week ago

Thanks for your response.

I'm fairly new to managing SharePoint, but the User Profile and Site User List areas of SharePoint really seem to be showing their age. I guess it's expected considering how long SharePoint has been supported and how much legacy tooling is built on top of it, but it's frustrating nonetheless.

Currently, the User Profile Service Application reports a total of 282,371 profiles. This number is lower than I anticipated, so I’ll need to investigate further I guess. It’s still quite odd that I'm getting less than half back using the library. I'll test with more specific filters (e.g., profiles with names starting with A, B, C, etc through to Z) to see if that yields better results. Thanks for the suggestion.

The end goal is to export all the user profile data, as well as the site user list data for every site in the tenant to an on-prem database. The database would act as a read-only cache for on-demand queries. It would also provide a mechanism to query for site UIL memberships for a given user, and report on any potential orphaned memberships or site ID mismatches as of the last cache update so we can proactively address issues and clean up sites when users are offboarded.

Currently whenever these requirements arise, we need to traverse every site collection within the entire SPO environment which is painfully slow, and seems quite wasteful. Having the data cached in a database optimized for these sorts of queries means it will take seconds rather than ~20+hrs to get useful data. It seems a bit crazy that I'm going down this rabbit hole, so if there's a better way to go about doing this, please let me know... :D

I think keeping the user profile cache updated once the initial data is exported should be fairly simple. I'm not sure about site user lists though - we have a few hundred thousand sites at least, and querying them all will be very slow and pretty expensive. I'm open to any suggestions if you have any there.

koltyakov commented 1 week ago

Yeah, the amounts definitely complicate the task.

I believe you can query UILs as an ordinary list as it is in the end of the day. There are approaches to effectively fetch only changed data from a list, this readme describes some spotlights how to do that https://github.com/koltyakov/spsync

There could be a difference in Entra ID users and SharePoint, e.g. if a user has never been opening SharePoint Online they not necessarily will be in UPS. Same with UIL, if a user never opened a Site Collection and was not selected in a Lookup they won't be in UIL.

Btw, did you try comparing users which you were not able finding in "full crawl" in SharePoint but know exist in Entra ID, and querying and individual user in search or anywhere in SharePoint UI?

This out of scopes of this library, yet for your global task you can also experiment with getting users from Graph API (https://learn.microsoft.com/en-us/graph/api/user-list?view=graph-rest-1.0&tabs=http).

f0oster commented 1 week ago

I managed to resolve this issue after changing the sort direction and filtering by IndexDocId. Everything is working smoothly now, and I can fetch all the profiles successfully! 😊

I think you're probably right - we don't pre-provision OneDrive for users, and if a user has never opened OneDrive, it's likely they wont have one, and subsequently won't show up anywhere in SharePoint either in that case. I'll likely validate this a bit later by enumerating our users via Graph and checking for the mySite property to find all the users with OneDrive attached. For now it's not super important.

I'll be sure to check out spsync--thanks! It sounds like it will come in handy.

I have another question regarding the user profiles returned by the search API.

The site user list REST API includes an internal ID that’s quite useful:

<d:UserId m:type="SP.UserIdInfo">
    <d:NameId>100320016feb059e</d:NameId>
    <d:NameIdIssuer>urn:federation:microsoftonline</d:NameIdIssuer>
</d:UserId>

This ID comes in handy when resolving user ID mismatches that occur when a user is deleted and later re-created with the same UPN.

On the user profile, the ID is represented like this and from memory the field is referred to as the SID:

SID: i:0#.f|membership|100320016feb059e@live.com

I haven’t found a way to include or access this ID when performing the user profile searches, so I’m querying each user individually to retrieve it. Is there a way to include this field when fetching the user profiles via the search API?

Just to be clear, I'm not talking about the LoginName - that field contains the UPN for the user. The SID field contains some internal SharePoint ID which can be used to map the UIL user back to the user profile itself - it's mentioned here.

koltyakov commented 1 week ago

I managed to resolve this issue after changing the sort direction and filtering by IndexDocId. Everything is working smoothly now, and I can fetch all the profiles successfully! 😊

Congrats! That's a great news! Such thing deserves a technical article. Similar as MS's one but with a correction what to use as a sort field.

koltyakov commented 1 week ago

Regarding SID, so far I have not manage to find it in search props.

f0oster commented 1 week ago

Hi @koltyakov,

Cool - I had a feeling that might be the case. I'll stick with querying the users directly for now.

Thank you again for your time, btw.

I'm not a SharePoint expert, and probably never will be! ...but I do have a feeling this library will be really helpful and be a major timesaver for me.

Given the size of our Entra / SPO environments, scripting scheduled cleanup and maintenance activities in our SharePoint Online (SPO) environment seems like it's going to be a lot easier with this library when compared to PowerShell.

I've used various SharePoint PowerShell modules, including both the Microsoft SPO module and the community-managed PnP module extensively. The PnP module is honestly fantastic, but its hard to get full value out of it with our tenant size. PowerShell really seems to struggle with the larger datasets - it slows to a crawl, and needs a lot of manual coaxing to keep the memory footprint reasonable. Without implementing logic to try and force the GC to clean things up better, the footprint seems to grow infinitely. With pipeline usage it's a massive challenge.

If I end up writing up anything on the project I'm working on, I'll let you know!

f0oster commented 2 days ago

A small update: Using PowerShell to index all sites and their user lists into memory took about 40 hours without including the Profile SID lookup (+1 additional query/lookup per user), and that's also without exporting/storing the data. Querying the data in memory with PowerShell, as you'd probably expect was not really sustainable.

Using go and gosip I was able to index all profiles from the User Profile Service and fetch their profile SIDs (280k users), query all sites (200k sites), fetch all site user list entries (5.5m), and store everything in a SQLite database for querying in about 16 hours. On top of that, I can now easily query on all site memberships (orphans and active), query to identify user profile mismatches and query for sites based on their activity / storage consumption as of the last export in a few seconds. Great success!

I haven't yet decided if I will bother with trying to track deltas. I'm expecting that I'll need to query each individual site to check if they've changed anyway, and given how fast that is and how simple the changes are, I probably won't bother. I think I'll just run a full export into a clean database each time, or I'll just upsert during the export and then drop anything that wasn't updated in the most recent export (ie: deleted users/sites/site UIL entries).

Thanks again for your awesome work on this library. 👍

koltyakov commented 2 days ago

Thank you @f0oster on the update and use case description!