feat(backend): Don't store null fields of processed data for perf, emit exactly current schema in get-released-data - imputing with null

resolves #2793

preview URL: https://enforce-schema.loculus.org

What does this PR solve?

This PR resolves 2 issues discussed among others in #2793 but recapitulated here:

We were wasting db storage and bandwidth by storing lots of metadata fields with null values in sequence_entries_preprocessed_data.processed_data in the db - in Pathoplexus this wastage might even be 50% or more of total row size, reaching a point that's worth optimizing.
Whenever we wanted to add a new metadata field, we needed to reprocess the data because the backend would rigidly send exactly what it had in sequence_entries_preprocessed_data.processed_data to the get-released-data consumer and SILO's preprocessing doesn't like missing (or extra as well, I think) fields in NDJSON. This created a lot of friction in deployment.

What does this PR do?

This PR introduces the idea of metadata compression, in parallel to the existing sequence compression. This fits very nicely into the existing code base, making the actual code (non-test) changes trivial.

Processed metadata is compressed before being stored in the db by removing all fields that have null values: https://github.com/loculus-project/loculus/blob/67161496adcbbf7d093bda423007ab63ad93124d/backend/src/main/kotlin/org/loculus/backend/service/submission/CompressionService.kt#L121
Processed metadata is decompressed after being retrieved from the db by filling back null to all the fields that are missing. At the same time, fields that are no longer in the current metadata schema are dropped so as to allow us to remove no-longer-wanted metadata fields: https://github.com/loculus-project/loculus/blob/67161496adcbbf7d093bda423007ab63ad93124d/backend/src/main/kotlin/org/loculus/backend/service/submission/CompressionService.kt#L89-L96

Aside: should external metadata always be a subset of metadata?

This is kind of a side discussion, only relevant because the decompression code needs to know what all the metadata fields are that should be filled with null.

As part of adapting tests to the new paradigm, I noticed that the backend config is a little unclear regarding external metadata:

should external metadata always be part of metadata? (this is what we do in practice right now via the helm chart), i.e. should external metadata fields be a subset of metadata
or should external metadata be allowed to have fields not in metadata? (this is what the backend test config was until now)

For me, it seems most logical to have metadata be the thing that the backend emits in get-released-data. Whether we allow externalMetadataSubmitter to override or not is a property of a metadata field (in my understanding). For now this difference doesn't matter other than in the way I resolved tests. In the backend tests I use my interpretation of enforcing externalMetadata to be a subset of metadata. We could in the future relax this but that would require test changes.

It might make more sense eventually though to get rid of the "externalMetadata" part of the config and add extra optional fields to the "metadata" fields listing whether or not they are allowed to be overridden by external metadata submitter and if so by which user.

If we want to keep things lose like the tests were right now, we can adapt this PR, it'd just have to then iterate over both metadata and external metadata schemas to get the full set of expected fields. Also: how would we treat cases where field values are identical but types differ?

Screenshot

No more nulls in processed data:

Testing

I've written a comprehensive unit test for the metadata compression/decompression behavior.

It's a bit tricky to write endpoint test cases for this - the relevant behavior to test would be changes of backend config (before and after addition/removal of a metadata field) which I couldn't figure out how to test well.

The changes made themselves seen in a couple of existing tests though. I checked thoroughly that we can now add and remove metadata fields on a persistent db and that things work as expected, e.g. https://github.com/loculus-project/loculus/pull/3232/commits/4e884b61d67bde6522ca6a75af304da0cab0d22f and https://github.com/loculus-project/loculus/pull/3232/commits/59833e14c5f6b3340a015c4370f6d2191a3eb590

PR Checklist

[ ] All necessary documentation has been adapted.
[x] The implemented feature is covered by an appropriate test.

Thanks @corneliusroemer this looks great to me! I just wanted to ask two follow ups - otherwise I'm happy to approve.

Do understand correctly that only newly processed data will be compressed - unless we increase the preprocessing pipeline version and we will then process all data again and it will be compressed?
About the external metadata: Although I wrote external metadata to be potentially disjoint from the metadata now that I understand our code ecosystem better this seems like an unnecessary complication - which we anyways do not use. So I am fine with enforcing external metadata to always be a subset of the full metadata. However, with the current config structure this doesn't have to be the case and we don't check for this. I think we should change the config structure to as you suggested - e.g. add an additional optional field to the metadata fields which specifies a field is external metadata - even better we could have the backend not expect two separate external metadata and metadata dictionaries but one metadata dictionary with this field. Does that sound alright with you?

Re @fengelniederhammer's comments:

How does this play together will data that is already in the database? If I interpret the current code correctly, then it will decompress all processed data. Will this work with uncompressed data in the database?

Compression is not really what's happening here, so it seems to give the wrong idea. We're not compressing metadata, we're just not storing null fields. Everything is null unless it's present with non-null value. But there's no harm in having extra nulls in there, the "decompress" code doesn't care, it just adds nulls back when necessary.

As a result, all not previously compressed data already in the db is perfectly valid and can be "uncompressed" without problem or loss of anything.

How much space does this actually save? Metadata isn't that large in terms of memory demand.

Saving storage (and amount of data to be streamed by db which is just as much a bottleneck) isn't the main and only reason for this, but it's a nice extra at no cost - well the cost is this line which is negligible I hope you agree: https://github.com/loculus-project/loculus/blob/d78c8b868de59248f23a69a93cac1dd465d6b953/backend/src/main/kotlin/org/loculus/backend/service/submission/CompressionService.kt#L121

I find it hard to argue why we should keep those nulls around if it's so simple to get rid of them.

I wonder whether the overhead (in code complexity and runtime) is worth the trouble?

If we want to be able to easily add metadata fields without having to reprocess then we need the "decompression" step, whether we keep nulls around or not. So that slight complexity is required: https://github.com/loculus-project/loculus/blob/67161496adcbbf7d093bda423007ab63ad93124d/backend/src/main/kotlin/org/loculus/backend/service/submission/CompressionService.kt#L89-L96

But are we really arguing over 4 lines as complexity - I'm actually quite proud of myself for the lack of complexity of the solution (it's mostly because you've set things up so nicely with the compression service, so full credit to you @fengelniederhammer)? For runtime, I'm very sure that this is almost entirely negligible. And we agreed that we would do this in #2793 and related PRs (where you said that you were happy for the backend to always emit what the schema defines ;) )

Re @anna-parker's comments:

Do understand correctly that only newly processed data will be compressed - unless we increase the preprocessing pipeline version and we will then process all data again and it will be compressed?

That's correct - compression is really not the right word though for metadata. It's more dehydration, throwing out things that are not required (but also don't do harm). Compression is entirely optional, has slight benefits of saving space and is otherwise irrelevant. But it just takes one line of code, so why not.

Even better we could have the backend not expect two separate external metadata and metadata dictionaries but one metadata dictionary with this field.

Excellent, that's exactly what I was thinking of! Make externality an optional extra of a metadata fields. We should probably do this in a separate PR - it should be pretty straightforward!

Compression is not really what's happening here, so it seems to give the wrong idea.

Oh I see. I misread the code (just had a quick look over it). I think it would be cleaner if we did not hide metadata processing in "compress" and "decompress" functions. Let's split out that functionality to a dedicated place?

Re how much storage is saved, running this script from ChatGPT

``` SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size, pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) AS table_size, pg_size_pretty(pg_indexes_size(schemaname || '.' || tablename)) AS indexes_size, pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename) - pg_relation_size(schemaname || '.' || tablename) - pg_indexes_size(schemaname || '.' || tablename)) AS toast_size FROM pg_tables WHERE schemaname NOT IN ('pg_catalog', 'information_schema') ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC; ```

on this branch vs main gives:

This branch (compressed)

loculus-#     pg_total_relation_size(schemaname || '.' || tablename) DESC;
      schemaname       |             tablename              | total_size | table_size | indexes_size | toast_size
-----------------------+------------------------------------+------------+------------+--------------+------------
 public                | sequence_entries_preprocessed_data | 74 MB      | 28 MB      | 2512 kB      | 43 MB
 public                | sequence_entries                   | 57 MB      | 37 MB      | 2120 kB      | 17 MB
... (truncating small/irrelevant tables)

main

      schemaname       |             tablename              | total_size | table_size | indexes_size | toast_size
-----------------------+------------------------------------+------------+------------+--------------+------------
 public                | sequence_entries_preprocessed_data | 104 MB     | 35 MB      | 3160 kB      | 66 MB
 public                | sequence_entries                   | 56 MB      | 37 MB      | 2144 kB      | 17 MB
... (truncating small/irrelevant tables)

So we're shaving off 30% in total size, 20% in table size and 30% in toast size for our largest table which is the preprocessed data. In an actual deployment that table is even more the largest component.

This is a cheap substantial perf win, do you agree now @fengelniederhammer :)

Just out of curiosity, here's our current prod pathoplexus result, preprocessed data is almost everything (we should truncate to speed things up again, we don't need old versions of processing lying around)

      schemaname       |             tablename              | total_size | table_size | indexes_size | toast_size
-----------------------+------------------------------------+------------+------------+--------------+------------
 public                | sequence_entries_preprocessed_data | 1127 MB    | 218 MB     | 21 MB        | 887 MB
 public                | sequence_entries                   | 112 MB     | 57 MB      | 2704 kB      | 52 MB
 ena_deposition_schema | submission_table                   | 5264 kB    | 120 kB     | 48 kB        | 5096 kB
 public                | audit_log                          | 4336 kB    | 3896 kB    | 88 kB        | 352 kB
 public                | data_use_terms_table               | 3272 kB    | 2304 kB    | 960 kB       | 8192 bytes
 ena_deposition_schema | assembly_table                     | 224 kB     | 168 kB     | 48 kB        | 8192 bytes
 public                | external_metadata                  | 200 kB     | 144 kB     | 48 kB        | 8192 bytes
 ena_deposition_schema | sample_table                       | 200 kB     | 144 kB     | 48 kB        | 8192 bytes
 public                | user_groups_table                  | 64 kB      | 8192 bytes | 48 kB        | 8192 bytes
 ena_deposition_schema | flyway_schema_history              | 48 kB      | 8192 bytes | 32 kB        | 8192 bytes
 public                | flyway_schema_history              | 48 kB      | 8192 bytes | 32 kB        | 8192 bytes
 ena_deposition_schema | project_table                      | 32 kB      | 8192 bytes | 16 kB        | 8192 bytes
 public                | sequence_upload_aux_table          | 32 kB      | 8192 bytes | 16 kB        | 8192 bytes
 public                | seqset_records                     | 32 kB      | 8192 bytes | 16 kB        | 8192 bytes
 public                | groups_table                       | 32 kB      | 8192 bytes | 16 kB        | 8192 bytes
 public                | metadata_upload_aux_table          | 32 kB      | 8192 bytes | 16 kB        | 8192 bytes
 public                | table_update_tracker               | 32 kB      | 8192 bytes | 16 kB        | 8192 bytes
 public                | seqsets                            | 32 kB      | 8192 bytes | 16 kB        | 8192 bytes
 public                | seqset_to_records                  | 32 kB      | 8192 bytes | 16 kB        | 8192 bytes
 public                | current_processing_pipeline        | 24 kB      | 8192 bytes | 16 kB        | 0 bytes

I think it would be cleaner if we did not hide metadata processing in "compress" and "decompress" functions. Let's split out that functionality to a dedicated place?

I potentially disagree, I think this is a great place, note I renamed the function to no longer mention sequences as we now compress sequences and metadata. I agree that it might be confusing to you as when you wrote this you only compressed sequences, but that's just because you wrote the code.

Maybe we could rename the service to something more general than compression. I wouldn't move it somewhere else, that would just complicate logic - the fact the required changes are so small is a good sign that this is the right place to make those changes.

I first had a separate service, but that would then have to be called in a number of places per endpoint (the way I had set it up) - now this is much cleaner.

The current location is good because we want this hydration/dehydration (or compression or whatever) to happen in one place before things go to db and when they come out of db. And it's natural to do it in the same place that we compress sequences.

@fengelniederhammer feel free to propose a split into something that makes sense for you, maybe in a PR that targets this PR here - I'm probably thinking of something different (and worse) than what you have in mind.

@fengelniederhammer feel free to propose a split into something that makes sense for you, maybe in a PR that targets this PR here - I'm probably thinking of something different (and worse) than what you have in mind.

-> #3241

loculus-project / loculus