Increase DB chunk size for respondent files

instedd / surveda

InSTEDD Surveda

https://instedd.org/technologies/surveda-mobile-surveys/

GNU General Public License v3.0

17 stars 6 forks source link

Increase DB chunk size for respondent files #2360

Closed matiasgarciaisaia closed 3 months ago

matiasgarciaisaia commented 3 months ago

Respondent files are usually large (Interactions files can grow up to 1M rows), and the "low" limit in queries made the DB work much more than needed (we've observed 99% CPU usage in the mysqld process when generating a 1M-rows interactions file with 1000 rows per query).

Increasing this limit makes the app generate less queries to the DB, effectively driving the CPU usage down to about 30% instead.

There's probably more room for improvement (the generation of the file is still CPU-bound instead of network-bound), but that's on the app itself - we should profile the app's code to further improve the performance.

See #2350 See #2359

matiasgarciaisaia commented 3 months ago

We suspect the CSV encoder may be suboptimal.

We've also tested with 100k and 50k limits for the interactions files, but the performance was a slightly worse (startup took a bit longer, and the overall speed didn't improve further).

matiasgarciaisaia commented 3 months ago

CC: @ggiraldez in case you want to add anything else.

ggiraldez commented 3 months ago

CC: @ggiraldez in case you want to add anything else.

I know very little of Elixir or Ecto, but you may also want to explore streaming directly from the database. The MySQL driver supports it, although it requires a transaction which may be a deal breaker perhaps? Anyway, see https://hexdocs.pm/ecto/Ecto.Repo.html#c:stream/2

anaPerezGhiglia commented 3 months ago

In line with @ggiraldez suggestion I realized that the incentives file is built using Repo.Stream instead of Stream.resource. I think it's worth the effort try changing the other three files to be built this way and see how the servers behave

matiasgarciaisaia commented 3 months ago

I tried changing the queries to use Ecto's Repo.stream (instead of manually doing Stream.resource and paginating from the app) but preloads are not supported on streams 🫠

There may still be room for doing the CSV streaming straight from the database (ie, make MySQL output CSV) as @ggiraldez suggested me, but I'm not sure if that'll work or not - I'll leave this as is, we might explore that optimization if we eventually need it. Given we're about to make the file generation async (in #2350) improving the times won't be that important, either.