aDNA metadata calculations

humlab-sead / sead_browser_client

Online browser client for the SEAD database

2 stars 0 forks source link

aDNA metadata calculations #374

Open MattiasSealander opened 1 week ago

MattiasSealander commented 1 week ago

Among the data supplied from SciLifeLab there is a count of number of libraries that are connected to a sample. As the data for each library is given in the library data sheet this could be achieved by counting the library IDs for a sample on the fly. Rather than storing it as a value in the DB.

I have contacted SciLifeLab to confirm that we will be getting all libraries for a sample in every case.

Should we be counting this on the fly, or is it better to store it as a value. It is metadata information, rather than a result, although I guess it is something aDNA specialists consider when evaluating the results. Ties into a bigger question of what should be solved by calculating stuff on the fly and what should be stored as values in DB.

For clarification, we will be storing results for samples as well as libraries. Currently, sample results and libraries would be different datasets, unless this needs to be reconsidered for some reason.

johanvonboer commented 1 week ago

It is an interesting question. I'm not sure what is meant by 'library' in this context, but in general I would say that if the libraries themselves are something that I would need to send to the client, then it makes sense to just store the link (and not the count) between the data/value and the libraries.

A similar example is how we handle feature_types, we don't store that a physical sample has X number of a certain feature_type, instead we store the link (array of feature_type IDs) in the physical_sample and then the definition of each feature_type is also sent to the browser along with the physical_sample data. Then when the website is being rendered the number of feature_types for each sample and their definitions are being looked up in real time through the data that then exists in the browser.

Would it make sense to handle this in a similar way or am I misunderstanding what 'libraries' are here?

MattiasSealander commented 1 week ago

Yes, it sounds that handling it the same way would work. TJ has answered that we will be getting all libraries (incl. those that didn't work well). A library is similar to a sample, in that you have a bunch of variables with results (some categories overlapping with the sample category). They separate the results into "sample results" and "library results" you could say. And one sample can have multiple libraries. As far as I understand it.

Right now, it seems that separating samples and libraries into different datasets is the best way, potentially using data_type_id to distinguish between the two.