loculus-project / loculus

An open-source software package to power microbial genomic databases
https://loculus.org
GNU Affero General Public License v3.0
34 stars 1 forks source link

Group name in `/get-released-data` should not come from preprocessed data but from join on the db table #2819

Closed corneliusroemer closed 1 day ago

corneliusroemer commented 2 days ago

We currently seem to populate the group name in /get-released-data from what's in preprocessed data (we pass the group info to preprocessing).

That's not great since it means updates to group name won't be reflected until there's a reprocessing.

We should pull in the group name from the appropriate db table instead.

Related to #2815

chaoran-chen commented 1 day ago

@corneliusroemer, are you sure? Looking at the code that Fabian pointed to in the #2815, I see that the data are from submissionDatabaseService.streamReleasedSubmissions() and in that function, the group ID and name are taken from the view in lines 604-605.

Sorry if I am reading this wrongly, did you try it out (e.g., by changing the group name manually in the database) and observe that the group name does not update in the new released data?

corneliusroemer commented 1 day ago

Thanks @chaoran-chen, the lines you point to look convincing and suggest my inference was wrong. I had neither tested nor investigated deeply. I didn't follow the code paths and assumed that rawProcessedData means it's from processing pipeline when in fact as you show we do seem to use the group table somewhere.

Why do we have that line that Fabian pointed to if we then don't use that node at all? I'm not quite sure what's going on there, is it unpacking the JSON received by the database?

I think this goes to show that my issue re investigation was justified as it's not immediately obvious where the group name comes from and the line that Fabian pointed to didn't fully resolve it. Maybe we closed that issue prematurely. 🫣

corneliusroemer commented 1 day ago

I'll close this here as not planned as I verified your conclusions - indeed we use the group entity here.

We should do the merge in sequence entries view though I think, it's confusing that we merge some things in the view and other things in backend code. Would be best to be consistent. Or is there something I'm missing why we should not do the merge in the view?

chaoran-chen commented 1 day ago

Ah, I see the confusion - the variable name might not be clear enough. Please take a look at these two lines:

https://github.com/loculus-project/loculus/blob/cca00bbc9174e3080e7507802ce838f186babc22/backend/src/main/kotlin/org/loculus/backend/model/ReleasedDataModel.kt#L59-L60

The data processed by the pipeline is in rawProcessedData.processedData and when you look at the group name

https://github.com/loculus-project/loculus/blob/cca00bbc9174e3080e7507802ce838f186babc22/backend/src/main/kotlin/org/loculus/backend/model/ReleasedDataModel.kt#L67

it is not coming from processedData.

chaoran-chen commented 1 day ago

How would you do it in the view? For the database and view, the metadata returned by the pipeline is just a JSON, how would you like to unpack it in a view? Further, the sequences are compressed, so it's even harder to unpack in a view and I don't see the reason and the problem with this implementation. Could you please elaborate a bit more?

chaoran-chen commented 1 day ago

Ah, do you mean joining the group name into the view?

corneliusroemer commented 1 day ago

Yes exactly I don't see why we do some things in a view and then we join the group name not in the view when we could do it there afaict.

chaoran-chen commented 1 day ago

Good point, yes I think that's a good idea.

chaoran-chen commented 1 day ago

Although.. The view currently is designed to rather normalized (i.e., with minimal redundancy) and it is not designed for a particular use case. Adding group name would create a quite strong redundancy and I'm not sure that this is the best code design. We should think a bit more about how we see the role of the view.

corneliusroemer commented 1 day ago

I'm not sure I understand, what's the problem of joining the group name there? That one might not need the name except for get release?

We already only join the latest processed data and the latest data use terms etc.

Ah do you mean that group name would be redundant across rows whereas the other columns wouldn't be because everything we are joining is unique at least at the accession level (though not necessarily version)