poseidon-framework / poseidon-schema

An archaeogenetic genotype data organisation file format
0 stars 1 forks source link

SSF: `sample_accession` and `secondary_sample_accession` should not have a *unique* constraint #61

Closed nevrome closed 1 year ago

nevrome commented 1 year ago

I encountered an example (https://github.com/poseidon-framework/published_data/blob/6025ab3fcccfb8f209869d6ff89eb90b168447ce/2018_VeeramahPNAS/ENAtable.ssf) where this condition does not hold, asked @TCLamnidis for input and thus learned that the condition should probably not be there in the first place.

What do you think, @stschiff? Could you explain why you specified that all sample_accessions in a .ssf file have to be unique?

@TCLamnidis suggested instead, that the run_accession field may have this constraint.

If we decided to change this, then we could add it to #60 for the next Poseidon release.

nevrome commented 1 year ago

I just realized that this is a pretty important question, because it boils down to which variable can act as a primary key to identify entries in a .ssf file. We don't necessarily need one, in fact, but it's generally better. Not least for error messages.

stschiff commented 1 year ago

Yes, this seems to have been an obvious mistake. Samples should not be unique, of course. There is a clear hierarchy, for example spelled out here in plain english

Briefly, you add Samples to a Study, Experiments to Samples and Runs to Experiments.

So, indeed, it is the run accession that should be unique, as it seems to be the last step in the hierarchy.

nevrome commented 1 year ago

Good - let's make this part of the schema and then our software tools asap.

Could you add a commit to #60, @stschiff? Then I will try to handle the rest in our software tools.

Let's not merge #60 yet, in case more issues are unveiled thanks to Dhana's PR.

stschiff commented 1 year ago

I think this has been resolved now in #60, can be closed when #60 is merged then.