Design question - Githubissues

yevgenypats commented 1 year ago

Hey @koltyakov, Im starting to kick the tires on this one so I can open an initial version. A few Sharepoint and gosip related question. What would make sense for CloudQuery to sync, as far as I can understand sharepoint model and APIs, you need to specify what you want to select and what fields? is this correct? In that case I can make it a required configuration in addition to auth so the user can configure what they want to sync/fetch?

Thanks! Yevgeny

koltyakov commented 1 year ago

Hey @yevgenypats, thanks for kicking the tires!

SharePoint API is relatively huge with different entities and hierarchical structures. However, the most common artifacts which deserves syncing to a reporting friendly destinations are lists. Lists are the heart of customizations with business data involved. A list is a sort of a table (actually it's a bit more complex under the hood, but the most oversimplified analogy is a table). The structure of a list is defined in a custom manner. I'd say that in a sync scenario if would be a definition which lists to sync, which fields in each list to touch. Maybe the tricky part would be a destination as it's not a static structure, but always different based on lists structure. Some core fields are common for all the lists, such as ID, version, created by/at, modified by/at, title, others are defined by a developer or a power user. Also, the data can be stored in different sites (workspaces).

So, assuming we start with a bare minimal yet already useful functionality:

Define a site to fetch data from
Defile lists to pull from
Each list can have an array if fields to provide explicitly and some core fields which always the case (it's very common that the set of fields is extended with time due to user needs)

*I'm omitting lot's of extra details such as content types, lookups, storing same entities types in a different sites/lists to do not overwhelm initial thinking.

Do you think it's feasible that the destination can be universal or automatically structured with CloudQuery approaches based on dynamic input configuration?

yevgenypats commented 1 year ago

Thanks for the detailed guide! I think that makes sense and yes, CloudQuery has notion of dynamic tables and schema that we can produce in runtime given a response from the server.

Can you point me to an example of how the API looks like to fetch from defined set of lists and how to parse the schema returns from those APIs?

koltyakov commented 1 year ago

Please check this sample https://github.com/koltyakov/sp-go-lists-sample I'll DM you the creds file.

yevgenypats commented 1 year ago

Thank you! This was super helpful and the creds all worked!

I've the draft that Im working on right now here - https://github.com/koltyakov/cq-source-sharepoint/pull/2.

Right now I've a question if I can get somewhere the list of all possible types in the Sharepoint model so I can convert them to CloudQuery types and take advantage of the CQ type system (we don't need to handle all types as we can have a default to string but still this would be very useful to have as many types as possible converted to the right type so it can be reflected appropriately in any database).

koltyakov commented 1 year ago

That's really cool @yevgenypats

I'm away from computer today. Will send SP List Field types and payload sample tomorrow.

koltyakov commented 1 year ago

@yevgenypats please check the list of most common field types in SharePoint lists. Many could be normalized to the same types in destination. SharePoint differentiates types due to formatting and historical aspects. Anyways, I'll catch up with such specifics when will be applying changes in the plugin according to the platform specifics.

SharePoint Field Types:

Field Data Type	Variation	Payload Sample
Text	Single	`"FieldTextSingle"`
Text	Multiline	`"FieldTextMulti\nFieldTextMulti\nFieldTextMulti"`
Number	Int	`123`
Number	Float	`123.45`
Currency		`123.99`
Yes/No (bool)		`true` / `false`
Person or group	Single	`10` (an ID from user information list)
Person or group	Multi	`[10, 12]` (IDs from user information list)
Date time		"2023-02-20T15:58:10Z" (ISO String)
Choice	Single	`"Choice 1"`
Choice	Multi	`["Choice 1", "Choice 2"]`
Hyper link		`{ "Description": "GitHub", "Url": "https://github.com" }`
Managed Metadata	Single	`{ "Label": "1", "TermGuid": "2e2b8c97-b1fc-4ad5-a17d-562b74cd9903", "WssId": 1 }`
Managed Metadata	Multi	Array of object with same structure as single variant
Lookup	Single	`1` (ID of an item from a lookup list)
Lookup	Multi	`[1, 2]` (IDa of items from a lookup list)

disq commented 1 year ago

@koltyakov Hi, I've been trying to implement the data sync part... I have a question on how get to the item.ToMap() keys from the POV of api.FieldInfo? I'm creating schema.Columns for each table, using FieldInfo.Title for the column names (lower cased and normalized) is that the correct approach? Some titles clash so I sometimes have to incrementally rename them (if there's already a name the second column becomes name_2, third name_3 and so on) but I suspect I might be on the wrong path... Should I use InternalName, or some other column? Ideally I would be OK with getting a value for each field.ID (so uniqueness is guaranteed) but ToMap() seems to return names...

koltyakov commented 1 year ago

Hi @disq, InternalName is something to use, internal names are unique within a list and immutable. Display names (title) can be changed anytime in UI and are not generally something to rely on.

disq commented 1 year ago

@koltyakov when I get the fields, it returns about ~80 columns. Some of them don't exist in the data but there are other columns instead. So I think I'm still missing something?

Here's a couple of log lines processing the list ListA:

12:08PM DBG item keys keys=["AttachmentFiles","Attachments","AuthorId","ComplianceAssetId","ContentType","ContentTypeId","Created","EditorId","Field1","Field2","Field3","Field4","FieldValuesAsHtml","FieldValuesAsText","FieldValuesForEdit","File","FileSystemObjectType","FirstUniqueAncestorSecurableObject","Folder","GUID","GetDlpPolicyTip","ID","Id","LikedByInformation","Modified","OData__UIVersionString","ParentList","Properties","RoleAssignments","ServerRedirectedEmbedUri","ServerRedirectedEmbedUrl","Title","Versions","__metadata"] module=sharepoint-src table=sharepoint_list_a

12:08PM WRN missing columns in result missing_columns=["AccessPolicy","AppAuthor","AppEditor","Author","BaseName","ContentVersion","Created_x0020_Date","DocIcon","Edit","Editor","EncodedAbsUrl","FSObjType","FileDirRef","FileLeafRef","FileRef","File_x0020_Type","FolderChildCount","HTML_x0020_File_x0020_Type","InstanceID","ItemChildCount","Last_x0020_Modified","LinkFilename","LinkFilename2","LinkFilenameNoMenu","LinkTitle","LinkTitle2","LinkTitleNoMenu","MetaInfo","NoExecute","Order","OriginatorId","ParentUniqueId","PermMask","PrincipalCount","ProgId","Restricted","SMLastModifiedDate","SMTotalFileCount","SMTotalFileStreamSize","SMTotalSize","ScopeId","SelectTitle","ServerUrl","SortBehavior","SyncClientId","UniqueId","WorkflowInstanceID","WorkflowVersion","_CommentCount","_CommentFlags","_ComplianceFlags","_ComplianceTag","_ComplianceTagUserId","_ComplianceTagWrittenTime","_CopySource","_EditMenuTableEnd","_EditMenuTableStart","_EditMenuTableStart2","_HasCopyDestinations","_IsCurrentVersion","_IsRecord","_Level","_ModerationComments","_ModerationStatus","_UIVersion","_UIVersionString","_VirusInfo","_VirusStatus","_VirusVendorID","owshiddenversion"] module=sharepoint-src table=sharepoint_list_a

12:08PM WRN extra columns found in result extra_columns=["AttachmentFiles","AuthorId","EditorId","FieldValuesAsHtml","FieldValuesAsText","FieldValuesForEdit","File","FileSystemObjectType","FirstUniqueAncestorSecurableObject","Folder","GetDlpPolicyTip","Id","LikedByInformation","OData__UIVersionString","ParentList","Properties","RoleAssignments","ServerRedirectedEmbedUri","ServerRedirectedEmbedUrl","Versions","__metadata"] module=sharepoint-src table=sharepoint_list_a

First log line shows which keys exist in the result. There's AttachmentFiles, AuthorId, EditorId, FieldValuesAsHtml etc. as extra data (according to the last log line - I'm guessing some of these need to be ignored) but here's also no data for the columns specified in line 2 (e.g. no AccessPolicy column in the data).

koltyakov commented 1 year ago

Hi @disq,

There are many service props in a list (and not only) object model. Not always they are needed for a data/app scenario.

In the API there are 2 OData modificators which allows limiting or extending props in the response: $select and $expand.

By default API doesn't return some props as getting them could be a heavy operations with extra joins on the backend, when they needed they explicitly requested within select/expand(projection).

On the other hand, by default it also can be too many column returned which are not needed for a view purposes. So it's very common and a good practice to explicitly provide only needed subset of fields in $select just like one do in SQL.

In an ETL scenario I would say that only selected fields should be synced to a destination and a few mandatory (even if omitted on input settings) such as Id, Created, Modified, maybe also AuthorId, EditorId, version and FSObjType.

I hope this shed some light.

koltyakov commented 1 year ago

If to assume that a default plugin behavior would be selecting all $select=* (however, I'd intend users to provide only what's needed, and defaulted empty choice to Title only and whose service fields we always be dragging) when we would definitely ignore some props. I can provide a criteria what to ignore (a hardcoded list of props + fields attribute based) a bit later today.

disq commented 1 year ago

@koltyakov thanks for the replies, I was able to come up with a 'SELECTive' approach. Unfortunately it had to come with a lot of options to tweak:

    // Lists to fetch, if empty all lists will be fetched
    Lists []string `json:"lists"`

    // ListFields is a map of list name to list of fields to fetch, if empty all DefaultFields will be fetched
    ListFields map[string][]string `json:"list_fields"`

    // DefaultFields is the fields to fetch if not specified in ListFields
    DefaultFields []string `json:"default_fields"`

    // IgnoreFields is the fields to always ignore
    IgnoreFields []string `json:"ignore_fields"`

    // FieldOverrides is a map of field name to type, used to override the detected type. If the field does not exist in the definitions, it will be forcibly added.
    FieldOverrides map[string]string `json:"field_overrides"`

    // pkColumn is the primary key column name, defaults to "Id"
    pkColumn string

I've set DefaultFields to Id, Created, Modified, Title, AuthorId, EditorId, FSObjType
Set the default IgnoreFields value to __metadata
Set 'FieldOverrides' to the Integer type for these 4 columns which don't come up in the field listing: AuthorId, EditorId, Id, FSObjType and forced Id to always be there and be the Primary Key.

koltyakov commented 1 year ago

@disq my suggestion for initial design:

Ignore the following fields until specifically provided (when it comes to checking /list/fields metadata):

Fields with internal name starting with "_"
Hidden fields (Hidden != true)
Computed fields (FieldTypeKind != 12)
Hardcoded list of fields:
- 'ComplianceAssetId', 'Attachments', 'AppAuthor', 'AppEditor', 'ItemChildCount', 'FolderChildCount' (will amend with time)

UPD: We commented almost simultaneously. :)

koltyakov commented 1 year ago

Forget to mentioned, the API will also return something extra which is not a field, e.g. __metadata or other nested entities. I'd suggest to ignore all of these so far as well.

disq commented 1 year ago

@koltyakov Thanks for the update... is there a way to determine if a field needs $expand or $select? After widening the field criteria seems like it's trying to select FieldPersonOrGroupSingle and fails. Or should I use expand?

koltyakov commented 1 year ago

An expand is the way to drill down, e.g. ?$select=Author/Title,Author/EMail&$expand=Author.

Such field types as SP.FieldUser, SP.FieldLookup are lookups for the values stored in other lists.

While it's possible to drill down with a joins, it's limited and encourage not a normalized sync. In an ETL I'd encourage folks to sync simpler and normalized data arrays, then join data in SQL/reporting by IDs.

Lookup fields have the following specific, while having a field name e.g. "Department", the field name in a response would be "DepartmentId" (Id suffix). It contains numeric ID of a lookup entity item(s).

Lookup fields also can't be selected without Id suffix or expand and prop drill down / syntax (e.g. ?$select=Department/Title&$expand=Department).

disq commented 1 year ago

I've now added Lookup/User/LookupMulti/UserMulti support, removed some options (no more DefaultFields or FieldOverrides) and it seems to work.

koltyakov commented 1 year ago

@yevgenypats @disq the initial version you've provided is great starting point. I was managed to run it locally and sync some lists to postgresql and sqlite destinations.

I will work on designing plugin specification to make it scalable and logical for SharePoint.

Also created some other issue items I will work on. Just for visibility.

yevgenypats commented 1 year ago

Amazing @koltyakov thanks for taking the lead on that and helping with the sharepoint expertise! Feel free to ping us if any questions and/or help needed around sdk or in genral.

yevgenypats commented 1 year ago

We will also update on our website early next week under our list of plugins a link to this sharepoint plugin as well for visibility and search.

One thing that you will need also is to tag a v1.0.0 so the goreleaser would run and create a release. this way cloudquery CLI will be able to download it via specifying name: koltyakov/cq-source-sharepoint for a source plugin

koltyakov commented 1 year ago

@yevgenypats @disq, I added base documentation and applied some mostly minor changes to make the plugin ready to the first version release.

I will research incremental scenarios later on for the post v1.0.0. As even now it's quite OK:

But we can do better, as there is the way triggering only changed data for the lists of any size without fetching all. When using change API it would be possible to sync what's changed in a seconds even for large Lists (M of items). Yet, I'd need to wrap my head around how CloudQuery deal with incremental strategies.

Also, I hope that the amendments of the spec will allow scaling up to other API object model entities which might be handy exporting to fast storages with ETL.

One design decision which probably is not very common for CloudQuery sources but I truly believe should be better for SharePoint lands (as platform could be throttling naughty and also having too many extra and not needed for the export objects) is only grabbing explicitly listed entities and same for the properties.

@disq your initial code was extremely useful, I really appreciate your efforts, sorry if I pivoted or removed something, all decisions were hard and only driven by the platform aspects of what might work better with SharePoint.

Technically, we should be ready for releasing the first version next week. Once again thank you guys! I have not seen such pace and passion for a long time in industry. You are fabulous! 🎉

yevgenypats commented 1 year ago

Thank you for the detailed update and the warm words!! 🙏

one thing that should actually be available is grabbing explicitly listed entities. I think if you specify the entities under tables: ["entity1", "entity2"] (under the main source spec) this should sync only those entities (This should even support simple glob like entity*). If it doesn't work that prob it's a bug on our end.

koltyakov commented 1 year ago

grabbing explicitly listed entities

I mean I amended plugin logic to only fetch what's listed, don't try all the things when nothing is provided.

Btw, I pushed v1.0.0 tag and checked that github plugin registry works.

update on our website early next week under our list of plugins a link to this sharepoint plugin

Please let me know what's the approach for the documentation the plugin page, should I craft a separate README.md with spec notes? As I documented not only how to use plugin but also a quick guide for CloudQuery + plugin in main README.md, which might be a bit redundant for experience CloudQuery users.

yevgenypats commented 1 year ago

I think one README in this repo with the spec would be a good approach for now. We will link directly to this GitHub from our plugins list.

koltyakov / cq-source-sharepoint

Design question #1