twentyhq / twenty

Building a modern alternative to Salesforce, powered by the community.
https://twenty.com
GNU Affero General Public License v3.0
15.8k stars 1.73k forks source link

Improve Data Export Performances #6828

Open BOHEUS opened 2 weeks ago

BOHEUS commented 2 weeks ago

Scenario:

  1. Log in
  2. Generate thousand of records of any object (e.g. Companies)
  3. Click on Options > Export in order to export them

Actual: Exporting thousands of records takes too much time (several minutes) and site is laggy for end user to generate a CSV having ~520 kB

Expected: Exporting thousands of records should take few minutes and site should be operational like normally (export should take place somewhere in the background on different thread?)

bachtly commented 1 week ago

Reproducible

Hi @BOHEUS, I can reproduce it with ~3k company records, which take 60s to process and the downloaded file is ~400KB.

Investigation

After investigation, I witnessed that it is slow even for only 3k records because of the pagination (limit 30), which results in ~100 API calls. While a single API call is cheap (~50ms), the overhead for each is large (~400ms) and the API calls are serial, not parallel.

Proposed solutions

I come up with some possible solutions as listed below: 1) Add a config of a larger page size for Export only, maybe around 300 - 1000 ?! 2) Add a config of the expected size of the payload (maybe 512KB - 1MB ?!) then compute the page size based on that. For example, with the first request, we may know that 30 records take 30KB, then simply increase the page size to 512 in the next requests. 3) Delegate the export operation to the backend, then stream the file to the client somehow. Assuming that with CRM, the user base is not that large and the number of downloads/day is low, it may not consume the server resource that much. Then we may schedule a job to periodically clean the temp files.

I evaluate solution (3) is ideal but quite complex which needs a huge change for backend and frontend, e.g. how to display progress, how to stream csv file to the client. The solution (1) is quite naive while solution (2) requires smaller changes and could work out of the box for up to 10^5 records.

Hi @Bonapara, which solution do you prefer? Could I take this?

Bonapara commented 1 week ago

@FelixMalfait will have a better answer for you ;)

FelixMalfait commented 1 week ago

Yes (3) is ideal and was the long-term plan. Frontend was never the solution for a lot of records that's why a cap was set.

One way to accelerate would be to change this: https://github.com/twentyhq/twenty/blob/main/packages/twenty-front/src/modules/object-record/record-action-bar/hooks/useRecordActionBar.tsx#L77 ; this delay was probably useless (meant to avoid crashing the app / giving it time to breathe, but probably not smart)

The number of records per page could also be increased once we get rid of pg_graphql

Parallel API call is probably not a good idea is it could DDOS the server quite quickly today.

I'd rather do (3) but this is a complex task indeed, we need to create a dedicated GraphQL resolver, store the file on S3, cleanup/delete the file automatically after sometimes, etc.