Closed corneliusroemer closed 1 day ago
@corneliusroemer, are you sure? Looking at the code that Fabian pointed to in the #2815, I see that the data are from submissionDatabaseService.streamReleasedSubmissions()
and in that function, the group ID and name are taken from the view in lines 604-605.
Sorry if I am reading this wrongly, did you try it out (e.g., by changing the group name manually in the database) and observe that the group name does not update in the new released data?
Thanks @chaoran-chen, the lines you point to look convincing and suggest my inference was wrong. I had neither tested nor investigated deeply. I didn't follow the code paths and assumed that rawProcessedData means it's from processing pipeline when in fact as you show we do seem to use the group table somewhere.
Why do we have that line that Fabian pointed to if we then don't use that node at all? I'm not quite sure what's going on there, is it unpacking the JSON received by the database?
I think this goes to show that my issue re investigation was justified as it's not immediately obvious where the group name comes from and the line that Fabian pointed to didn't fully resolve it. Maybe we closed that issue prematurely. 🫣
I'll close this here as not planned as I verified your conclusions - indeed we use the group entity here.
We should do the merge in sequence entries view though I think, it's confusing that we merge some things in the view and other things in backend code. Would be best to be consistent. Or is there something I'm missing why we should not do the merge in the view?
Ah, I see the confusion - the variable name might not be clear enough. Please take a look at these two lines:
The data processed by the pipeline is in rawProcessedData.processedData
and when you look at the group name
it is not coming from processedData
.
How would you do it in the view? For the database and view, the metadata returned by the pipeline is just a JSON, how would you like to unpack it in a view? Further, the sequences are compressed, so it's even harder to unpack in a view and I don't see the reason and the problem with this implementation. Could you please elaborate a bit more?
Ah, do you mean joining the group name into the view?
Yes exactly I don't see why we do some things in a view and then we join the group name not in the view when we could do it there afaict.
Good point, yes I think that's a good idea.
Although.. The view currently is designed to rather normalized (i.e., with minimal redundancy) and it is not designed for a particular use case. Adding group name would create a quite strong redundancy and I'm not sure that this is the best code design. We should think a bit more about how we see the role of the view.
I'm not sure I understand, what's the problem of joining the group name there? That one might not need the name except for get release?
We already only join the latest processed data and the latest data use terms etc.
Ah do you mean that group name would be redundant across rows whereas the other columns wouldn't be because everything we are joining is unique at least at the accession level (though not necessarily version)
We currently seem to populate the group name in
/get-released-data
from what's in preprocessed data (we pass the group info to preprocessing).That's not great since it means updates to group name won't be reflected until there's a reprocessing.
We should pull in the group name from the appropriate db table instead.
Related to #2815