[Review recommendation] Integrate universal sample IDs provided by the BioSamples database

nevrome commented 4 months ago

This recommendation was raised in the review of the Poseidon paper.

The BioSamples database (https://www.ebi.ac.uk/biosamples/) is an attempt to provide universal sample IDs across the life sciences and is used by the archives for sequence reads (ENA/SRA/DDBJ). Essentially every published ancient sample already has a BioSample accession, because this is required for the submission of sequence reads to ENA/SRA/DDBJ. It would thus have seemed natural to make BioSamples IDs a central component of Poseidon metadata, so as to anchor Poseidon to the mainstream infrastructure, but this is not really done. There are some links being made to ENA in the .ssf "sequence source" files used by the Poseidon package, including sample accessions, but this seems more ad-hoc.

nevrome commented 4 months ago

The biosamples FAQ states

What pattern do BioSamples accessions follow?

BioSample accessions always begin with SAM. The next letter is either E or N or D depending if the sample information was originally submitted to EMBL-EBI or NCBI or DDBJ respectively. After that, there may be an A or a G to denote an Assay sample or a Group of samples. Finally there is a numeric component that may or may not be zero-padded.

This seems to match to the sample_accession field in the .ssf file, which identifies sequencing entities, not "samples" in the Poseidon sense. Is this correct? If we already have this covered in the .ssf file then maybe we should not add it to the .janno file as well.

stschiff commented 4 months ago

Yes, I think so too. Plus, we have Genetic_Source_Accession_IDs in the janno, which allows to specify the ENA sample ID as well.

poseidon-framework / poseidon-schema

[Review recommendation] Integrate universal sample IDs provided by the BioSamples database #78