CloudQuery Source Plugin?

yevgenypats commented 1 year ago

Hi Team, hopefully this is right place to ask, if not, I'd appreciate if you can direct me.

I'm the founder of cloudquery.io, a high performance open source ELT framework.

Our users are interested in a Plain plugin, but as we cannot maintain all the plugins ourselves, I was curious if this would be an interesting collaboration, where we would help implement an initial source plugin, and you will help maintain it.

This will give your users the ability to sync Plain data to any of their datalakes/data-warehouses/databases easily using any of the growing list of CQ destination plugins.

Best, Yevgeny

phoenixy1 commented 1 year ago

@yevgenypats thanks for reaching out! I think this could be interesting, we'd definitely like to learn more. Some questions that I have include:

What would the maintenance burden look like, and what kind of changes would require making updates to the plugin?
Which customers have expressed interest in the plugin?

Since I imagine you might not want to answer these questions (especially the second one) on a public forum, feel free to reach out to me privately via Twitter DM (@plaiddev) or email (ahoffer@plaid.com).

yevgenypats commented 1 year ago

Hi @phoenixy1, that's great to hear!

Maintaining the plugin is relatively straight forward and consist of two of things:

When you introduce a new API and you want to be able to sync this api with CloudQuery you will need to add this table to the plugin. See stripe example. This is usually less than 20 LOC per API. This can be done either pro-actively when new API is added or you can wait for one of the users that use the plugin to ask for the missing API.
If you want to get updates of our SDK when we introduce new features (This can be automated with dependent bot).

Customers that usually have something like CDP (Customer Data Platform) where they extract all customer, financial data from vendor APIs to their data-lakes so they can do further analysis.

Following up also via email! Thank you 🙏

erezrokah commented 1 year ago

Hi everyone 👋 You can find the initial version of the Plaid plugin here: https://github.com/cloudquery/cq-source-plaid

Please let us know what you think

phoenixy1 commented 1 year ago

Thanks!! I'll take a more comprehensive look later, just a couple of thoughts off the top of my head:

It would be good to be more explicit in the documentation that the means of getting an access token in the README is only suitable for testing and that clients will need to build out a hosted frontend UI (including update mode, OAuth redirection, etc. etc.) and a backend token exchange flow (that doesn't involve copying and pasting the token from the client into the plugin) if they want to use real user data in a Production environment.
It would be good to describe exactly which Plaid APIs / endpoints are supported in the README. I see you don't have identity and I might suggesting adding that one as it's pretty popular.
Based on a quick glance at the code, I'm not sure whether this will work in a way that satisfies your users' use cases for some of the more dynamic products like Transactions, where business logic is required to construct an accurate picture of what's going on in a user's account, so that might be something to think about. I don't know enough about how your plugin is used to make a definitive statement on this, however.
When you test, I would definitely try running this in Development and making sure the plugin is robust to things like new fields being added or API calls erroring out, this happens not infrequently in real-world environments.

erezrokah commented 1 year ago

Hi @phoenixy1, sorry for the late reply. Thank you for the thoughtful feedback. I'll go over the issues you mentioned and follow up

erezrokah commented 1 year ago

It would be good to be more explicit in the documentation that the means of getting an access token in the README is only suitable for testing and that clients will need to build out a hosted frontend UI (including update mode, OAuth redirection, etc. etc.) and a backend token exchange flow (that doesn't involve copying and pasting the token from the client into the plugin) if they want to use real user data in a Production environment.

Can you explain this a bit more. I don't have enough context into how Plaid works in Production. I thought generating an access token is a one time thing as it's long lived and can be used to query data later on. What's the reason a hosted frontend UI is needed with OAuth is required? Can't the example app provided can be used to generate a long lived token? If not, can you please point me to some docs regarding how to do it, so I can link from the README?

It would be good to describe exactly which Plaid APIs / endpoints are supported in the README. I see you don't have identity and I might suggesting adding that one as it's pretty popular.

Great point, I made that more clear in https://github.com/cloudquery/cq-source-plaid/pull/5 and added Identity in https://github.com/cloudquery/cq-source-plaid/pull/6 (for some reason I couldn't get it to work initial, but tested it again and it does work).

Based on a quick glance at the code, I'm not sure whether this will work in a way that satisfies your users' use cases for some of the more dynamic products like Transactions, where business logic is required to construct an accurate picture of what's going on in a user's account, so that might be something to think about. I don't know enough about how your plugin is used to make a definitive statement on this, however.

Usually the information will be saved in a database (e.g. Postgres), so as long as all the information is available users can query it based on their needs. Is there data that is missing that we should add?

When you test, I would definitely try running this in Development and making sure the plugin is robust to things like new fields being added or API calls erroring out, this happens not infrequently in real-world environments.

Regarding API calls erroring out, we have retry logic in place: https://github.com/cloudquery/cq-source-plaid/blob/b5bde008364d7a79853f5a053f206efa339dc53d/client/client.go#L52 Anything we should add to the default policy?
As for new fields, would those be reflected in an update to the Go client? If so those should get added once we update the dependency. Each destination has logic to handle migration of existing tables, though some changes might be breaking (e.g. removal of a field, type change). We release those as a new major version of the plugin.

Thanks again for the feedback and would love to hear additional thoughts

phoenixy1 commented 1 year ago

It would be good to be more explicit in the documentation that the means of getting an access token in the README is only suitable for testing and that clients will need to build out a hosted frontend UI (including update mode, OAuth redirection, etc. etc.) and a backend token exchange flow (that doesn't involve copying and pasting the token from the client into the plugin) if they want to use real user data in a Production environment.

Can you explain this a bit more. I don't have enough context into how Plaid works in Production. I thought generating an access token is a one time thing as it's long lived and can be used to query data later on. What's the reason a hosted frontend UI is needed with OAuth is required? Can't the example app provided can be used to generate a long lived token? If not, can you please point me to some docs regarding how to do it, so I can link from the README?

Yeah, once you have an access token, you're for the most part good. However, at most banks, if the user ever changes their password, the access token breaks and the user needs to go through Link again using something called update mode.

But the big problem is the process of getting the access token -- unless the use case for Plaid is a hobbyist use case where a developer is just building something for their own personal use, the person putting in the credentials into Link is the end user of the app, NOT the developer. Unless the developer is hosting link, the customer doesn't have any way to access Link (because I assume you're not asking your end customers to run a self-hosted app and then email you the access token or something), and it's not a good security practice to expose the access token client-side. So if you were doing this in real life, you as the developer would need to have a server your end user could go to, log into link, get a public token, and then have that public token be exchanged for an access token on your server.

The OAuth stuff comes into play mostly on mobile devices, but basically for banks that use oauth-based connections the developer sometimes has to do some extra work to get Link working properly on mobile since there's a redirect to the bank website during the link flow.

It would be good to describe exactly which Plaid APIs / endpoints are supported in the README. I see you don't have identity and I might suggesting adding that one as it's pretty popular.

Great point, I made that more clear in feat: Add Identity cloudquery/cq-source-plaid#5 and added Identity in fix(docs): Make it more clear what resources are supported cloudquery/cq-source-plaid#6 (for some reason I couldn't get it to work initial, but tested it again and it does work).

Based on a quick glance at the code, I'm not sure whether this will work in a way that satisfies your users' use cases for some of the more dynamic products like Transactions, where business logic is required to construct an accurate picture of what's going on in a user's account, so that might be something to think about. I don't know enough about how your plugin is used to make a definitive statement on this, however.

Usually the information will be saved in a database (e.g. Postgres), so as long as all the information is available users can query it based on their needs. Is there data that is missing that we should add?

So the way the Plaid Transactions works is that it's a subscription product where you pay for monthly access to transactions. Most customers of the transactions product will be calling the transactions endpoints daily / multiple times a day, and the logic around the endpoint will need to make sure to fetch only new transactions that haven't been seen before. However, I think it's fine if the business logic to reconcile transactions is on the customer as long as they are expecting that.

When you test, I would definitely try running this in Development and making sure the plugin is robust to things like new fields being added or API calls erroring out, this happens not infrequently in real-world environments.

Regarding API calls erroring out, we have retry logic in place: https://github.com/cloudquery/cq-source-plaid/blob/b5bde008364d7a79853f5a053f206efa339dc53d/client/client.go#L52 Anything we should add to the default policy?

As for new fields, would those be reflected in an update to the Go client? If so those should get added once we update the dependency. Each destination has logic to handle migration of existing tables, though some changes might be breaking (e.g. removal of a field, type change). We release those as a new major version of the plugin.

Thanks again for the feedback and would love to hear additional thoughts

Yes, new fields will be reflected in an update to the Go client.

phoenixy1 commented 1 year ago

Sorry, I should also follow up that with regard to the retry logic, it's not uncommon for data sources (banks) to be down for hours / days at a time and then come back up, so you should expect that sometimes your API calls just won't work.

erezrokah commented 1 year ago

Thanks for the detailed explanation @phoenixy1!

Re-transactions, we currently get all of them, but by default we'll remove the stale ones from previous syncs (a sync is a single execution of the CloudQuery CLI). We also have support for incremental resources (i.e. getting data based on a cursor), but I haven't implemented it yet. So by default consumers of the data will not see any duplicates, and we can add support for getting only new transactions to improve performance.
Re-retry logic, if the API calls don't work an error will be reported and users can re-run CloudQuery at a later stage.

I think the main remaining challenge is the access token and understanding which is the person using CloudQuery, the end user or the developer setting up the frontend and backend. Would the following make sense (so I can document it):

Developer sets up a production environment with a frontend client and a backend
Users go to the frontend, authenticate with an institution to kick off token exchange flow
Backend saves the access token(s) after the token exchange flow is done
In a separate process/job developer runs CloudQuery using access tokens to get data

Does that summarize your intention?

erezrokah commented 1 year ago

Does that summarize your intention?

Something in the lines of https://github.com/cloudquery/cq-source-plaid/pull/7/files

phoenixy1 commented 1 year ago

Would the following make sense (so I can document it):

Developer sets up a production environment with a frontend client and a backend

Users go to the frontend, authenticate with an institution to kick off token exchange flow

Backend saves the access token(s) after the token exchange flow is done

In a separate process/job developer runs CloudQuery using access tokens to get data

Does that summarize your intention?

Yes, exactly! There are some extra nuances for a more robust integration, but I think this describes the MVP / happy path flow pretty well.

phoenixy1 commented 1 year ago

OK, looking at the latest version -- I haven't tried to run it and I'm not super familiar with CloudQuery Source, but it seems mostly pretty reasonable as far as I can tell? Let me know if there's anything you have specific questions or concerns about.

erezrokah commented 1 year ago

Hi @phoenixy1 thanks for the additional review and follow up. I think we're good on our end if you're happy with the first version.

What we would hope to see next is a backlink and a short blurb from the Plaid docs to CloudQuery. Not sure what would be the best place to put it (https://plaid.com/docs/api/libraries/ or https://plaid.com/docs/api/ or whatever works for you).

Something in the lines for:

[CloudQuery Plaid plugin](https://github.com/cloudquery/cq-source-plaid) extracts data from Plaid and loads it into any supported CloudQuery destination (PostgreSQL, Snowflake, BigQuery, S3...).

Will that work for you?

phoenixy1 commented 1 year ago

doing some cleanup -- I forgot to comment here, but we added the requested linkback a while ago, so I'm closing this ticket!

erezrokah commented 1 year ago

Thanks for the comment @phoenixy1 and adding the backlink. For reference to people watching this issue, it's here https://plaid.com/docs/resources/#third-party-resources

plaid / plaid-go

CloudQuery Source Plugin? #263