plausible / analytics

Simple, open source, lightweight (< 1 KB) and privacy-friendly web analytics alternative to Google Analytics.
https://plausible.io
GNU Affero General Public License v3.0
19.14k stars 1.03k forks source link

Populate site_imports from sites.imported_data in CE #4155

Closed ruslandoga closed 1 month ago

ruslandoga commented 1 month ago

Changes

This PR makes site_imports be automatically populated from sites.imported_data on startup for CE builds. Right now the self-hosters need to do it manually. See https://github.com/plausible/analytics/discussions/4125#discussioncomment-9580118

Tests

Changelog

Documentation

Dark mode

zoldar commented 1 month ago

Unfortunately, this might not work as ClickhouseRepo is most likely not running when this migration is executed. The migration script uses both, Plausible.Repo and Plausible.ClickhouseRepo.

 % MIX_ENV=ce_dev m ecto.migrate

11:21:09.474 [info] == Running 20240528115149 Plausible.Repo.Migrations.MigrateSiteImports.up/0 forward
Backfilling legacy site import across 0 sites (DRY RUN: false)...
Finished backfilling sites.

11:21:09.499 [debug] QUERY OK source="sites" db=0.2ms
SELECT s0."id", s0."domain", s0."timezone", s0."public", s0."locked", s0."stats_start_date", s0."native_stats_start_at", s0."allowed_event_props", s0."conversions_enabled", s0."props_enabled", s0."funnels_enabled", s0."ingest_rate_limit_scale_seconds", s0."ingest_rate_limit_threshold", s0."domain_changed_from", s0."domain_changed_at", s0."imported_data", s0."inserted_at", s0."updated_at", s0."id" FROM "sites" AS s0 WHERE (s0."id" = ANY($1)) [[2, 1]]
Adjusting end dates of 2 site imports (DRY RUN: false)...
Adjusting end date for site import 2 (1/2) (site ID 1, start date: 2020-05-31, end date: 2022-06-01)
** (RuntimeError) could not lookup Ecto repo Plausible.ClickhouseRepo because it was not started or it does not exist
    (ecto 3.11.2) lib/ecto/repo/registry.ex:22: Ecto.Repo.Registry.lookup/1
    (ecto 3.11.2) lib/ecto/repo/supervisor.ex:160: Ecto.Repo.Supervisor.tuplet/2
    (plausible 0.0.1) lib/plausible/clickhouse_repo.ex:2: Plausible.ClickhouseRepo.all/2
    (plausible 0.0.1) lib/plausible/data_migration/site_imports.ex:154: Plausible.DataMigration.SiteImports.imported_stats_end_date/2
    (plausible 0.0.1) lib/plausible/data_migration/site_imports.ex:92: anonymous fn/4 in Plausible.DataMigration.SiteImports.adjust_site_import_end_dates/2
    (elixir 1.16.0) lib/enum.ex:2528: Enum."-reduce/3-lists^foldl/2-0-"/3
    (plausible 0.0.1) lib/plausible/data_migration/site_imports.ex:80: Plausible.DataMigration.SiteImports.adjust_site_import_end_dates/2
    (plausible 0.0.1) lib/plausible/data_migration/site_imports.ex:45: Plausible.DataMigration.SiteImports.run/1

I have used https://hexdocs.pm/ecto_sql/Ecto.Migrator.html#with_repo/3 wrapper function to ensure ClickhouseRepo is running when data migration is executed:

% MIX_ENV=ce_dev m ecto.migrate

11:28:21.138 [info] == Running 20240528115149 Plausible.Repo.Migrations.MigrateSiteImports.up/0 forward
Backfilling legacy site import across 0 sites (DRY RUN: false)...
Finished backfilling sites.

11:28:21.180 [debug] QUERY OK source="sites" db=0.2ms
SELECT s0."id", s0."domain", s0."timezone", s0."public", s0."locked", s0."stats_start_date", s0."native_stats_start_at", s0."allowed_event_props", s0."conversions_enabled", s0."props_enabled", s0."funnels_enabled", s0."ingest_rate_limit_scale_seconds", s0."ingest_rate_limit_threshold", s0."domain_changed_from", s0."domain_changed_at", s0."imported_data", s0."inserted_at", s0."updated_at", s0."id" FROM "sites" AS s0 WHERE (s0."id" = ANY($1)) [[2, 1]]
Adjusting end dates of 2 site imports (DRY RUN: false)...
Adjusting end date for site import 2 (1/2) (site ID 1, start date: 2020-05-31, end date: 2022-06-01)
End date of site import 2 (site ID 1) is left unadjusted.
Adjusting end date for site import 3 (2/2) (site ID 2, start date: 2020-05-31, end date: 2024-05-13)
End date of site import 3 (site ID 2) is left unadjusted.
Finished adjusting end dates of site imports.
Finished

11:28:21.306 [info] == Migrated 20240528115149 in 0.1s

11:28:21.326 [info] Migrations already up
ruslandoga commented 1 month ago

Can the script switch to using Plausible.IngestRepo?

zoldar commented 1 month ago

@ruslandoga Wouldn't Plausible.Repo be missing from running apps then?

ruslandoga commented 1 month ago

Why would it be missing?

zoldar commented 1 month ago

@ruslandoga That's why it's failing here in the first place, to my understanding. Looking at https://github.com/elixir-ecto/ecto_sql/blob/v3.11.2/lib/mix/tasks/ecto.migrate.ex#L151-L158 - the migration task only starts a particular repo for the duration of executing the migrations, not the whole app (or other repos, in parallel). I think the same holds for migrator in releases.

zoldar commented 1 month ago

This fails with migrator run via release because ce?() expands Mix.env() which is in turn used in runtime context here:

$ ./migrate.sh 
Loading plausible..
Starting dependencies..
Starting repos..
Running migrations for Elixir.Plausible.Repo
** (UndefinedFunctionError) function Mix.env/0 is undefined (module Mix is not available)
    Mix.env()
    (plausible 0.0.1) expanding macro: Plausible.ce?/0
    lib/plausible-0.0.1/priv/repo/migrations/20240528115149_migrate_site_imports.exs:6: Plausible.Repo.Migrations.MigrateSiteImports.up/0
    (elixir 1.16.0) expanding macro: Kernel.if/2
    nofile:1: (file)

I think it's better to add an explicit migration step to upgrade instructions instead, like:

docker compose exec plausible bin/plausible rpc "Plausible.DataMigration.SiteImports.run(dry_run: false)"
ruslandoga commented 1 month ago

I think it's easier to work around this than to push it down to self-hosters to perform another manual step.

ce? can be made into a function similar to Plausible.product_name/0, while on_ce and others would stay macros.

PR: https://github.com/plausible/analytics/pull/4158

ruslandoga commented 1 month ago

the migration task only starts a particular repo for the duration of executing the migrations, not the whole app (or other repos, in parallel)

We are not using mix ecto.migrate: https://github.com/plausible/analytics/blob/master/rel/overlays/migrate.sh#L6

eval starts the whole app with all the repos.

zoldar commented 1 month ago

We are not using mix ecto.migrate:

Anyone trying to setup the application locally (for instance, with MIX_ENV=ce|ce_test mix ecto.setup) will have this migration fail on them without both repos up.

P.S. The updated ce?() check works when used at runtime now, confirmed with a test run on staging.

ruslandoga commented 1 month ago

Anyone trying to setup the application locally (for instance, with MIX_ENV=ce|ce_test mix ecto.setup) will have this migration fail on them without both repos up.

MIX_ENV=ce|ce_test mix ecto.setup fails already since seeds use funnels :) And it doesn't seem like ClickhouseRepo is called if there are no sites in the DB (the setup use-case). Something feels off to me about this migrator wrapper.

zoldar commented 1 month ago
ruslandoga commented 1 month ago

eval starts the whole app with all the repos.

I was wrong, only the selected repos and apps are started right now.

https://github.com/plausible/analytics/blob/ebabf75cfe87940f39ab60b5dcd82bcf62fdaa6b/lib/plausible_release.ex#L178-L182

[
  {:ecto_sql, ~c"SQL-based adapters for Ecto and database migrations", ~c"3.11.1"},
  {:ch, ~c"HTTP ClickHouse driver for Elixir", ~c"0.2.5"},
  {:mint, ~c"Small and composable HTTP client.", ~c"1.6.0"},
  {:hpax, ~c"Implementation of the HPACK protocol (RFC 7541) for Elixir", ~c"0.2.0"},
  {:castore, ~c"Up-to-date CA certificate store.", ~c"1.0.7"},
  {:ecto, ~c"A toolkit for data mapping and language integrated query for Elixir", ~c"3.11.2"},
  {:eex, ~c"eex", ~c"1.16.0"},
  {:postgrex, ~c"PostgreSQL driver for Elixir", ~c"0.17.5"},
  {:jason, ~c"A blazing fast JSON parser and generator in pure Elixir.\n", ~c"1.4.1"},
  {:decimal, ~c"Arbitrary precision decimal arithmetic.", ~c"2.1.1"},
  {:db_connection, ~c"Database connection behaviour for database transactions and connection pooling\n", ~c"2.6.0"},
  {:telemetry, ~c"Dynamic dispatching library for metrics and instrumentations", ~c"1.2.1"},
  {:ssl, ~c"Erlang/OTP SSL application", ~c"11.1"},
  {:public_key, ~c"Public key infrastructure", ~c"1.15"},
  {:asn1, ~c"The Erlang ASN1 compiler version 5.2.1", ~c"5.2.1"},
  {:crypto, ~c"CRYPTO", ~c"5.4"},
  {:logger, ~c"logger", ~c"1.16.0"},
  {:elixir, ~c"elixir", ~c"1.16.0"},
  {:compiler, ~c"ERTS  CXC 138 10", ~c"8.4.1"},
  {:stdlib, ~c"ERTS  CXC 138 10", ~c"5.2"},
  {:kernel, ~c"ERTS  CXC 138 10", ~c"9.2"}
]

So this migration still needs a workaround to either:

defp prepare do
  # ...
  Enum.each([Plausible.ClickhouseRepo | repos()], & &1.start_link(pool_size: 2))
end

I've opened a PR with the latter: https://github.com/plausible/analytics/pull/4155