loculus-project / loculus

An open-source software package to power microbial genomic databases
https://loculus.org
GNU Affero General Public License v3.0
37 stars 2 forks source link

sketch: do not hide metadata processing in sequence compression function #3241

Closed fengelniederhammer closed 2 days ago

fengelniederhammer commented 2 days ago

Summary

https://github.com/loculus-project/loculus/pull/3232#issuecomment-2485795043

It's missing tests and probably some thought over how to embed this into the rest of the code (e.g. the CompressionService is still used separately), but this sketches how we could separate the concerns (metadata postprocessing vs. sequence compression). Separation of concerns is also my main motivation for this.

Screenshot

PR Checklist

corneliusroemer commented 2 days ago

Thanks, this is good, I see what you mean now, the structure is clear at the expense of more layering/intermediation.

I have no objections to merging this in if it works and the tests of #3232 are adapted to work with this but I don't have the bandwidth to do this myself right now.

There might be a slight perf hit to doing it this way here as opposed to the original due to extra copying but probably negligible.

corneliusroemer commented 2 days ago

Ah this is actually super easy to unit test now, easier than before - one extra advantage of this new organization

corneliusroemer commented 2 days ago

@fengelniederhammer I've adapted the unit tests and made a function name more precise (it also filters out extra fields that are not in the schema but in the metadata) - how does this look to you? Happy for you to merge this in.

See: https://github.com/loculus-project/loculus/pull/3241/commits/1d90b9ed5449a29db76bb7d038e1764a1a00e6c5

fengelniederhammer commented 2 days ago

I added a commit to use Hamcrest matchers, because they provide better assertion errors than assertTrue. Looks good to me :+1:

fengelniederhammer commented 2 days ago

There might be a slight perf hit to doing it this way here as opposed to the original due to extra copying but probably negligible.

According to the docs, copy does not copy in the sense of "copying memory", so I think there is actually no performance impact. (The method name is misleading if you're used to Rust or C++)

Use the copy() function to copy an object, allowing you to alter some of its properties while keeping the rest unchanged.

https://kotlinlang.org/docs/data-classes.html#copying

corneliusroemer commented 2 days ago

Very nice, thanks! I see, makes sense that it's a CreateOnWrite (cow) under the hood or something like that.