[Review recommendation] Document the reference genome used for each sample

nevrome commented 3 months ago

This recommendation was raised in the review of the Poseidon paper.

I could not find any discussion of reference genomes: knowing the reference genome coordinate system is essential to using any genotype file. For comparison, in the EVA archive, every VCF dataset has a "Genome Assembly" metadata field specifying the accession number of the reference genome used. It would seem to me like a reference genome field should be part of a Poseidon package too. In practice, the authors likely use some variant of the hg19 / GRCh37 human reference, which is still widely used in ancient genomics despite being over a decade out of date. Insisting on using an outdated reference genome is one way in which the ancient genomics community is disaligning itself from the mainstream, and it complicates comparisons to data from other sub-fields of genomics.

nevrome commented 3 months ago

This sounds straight forward to me. We can just add another .janno column Reference_Genome_Assembly allowing for one accession number. Is this the relevant one for most of the data in the public archives: GCA_000001405.14 ?

stschiff commented 3 months ago

Yes, in principle this one is easy and definitely useful. The devil is in the details though. In many cases, people simply don't know the exact reference, at least not with the ID used by ENA. Most people here, for example, use "hs37d5" or "hg19", and would find it hard to make sure which exact assembly ID it is. I think we might have to make this a free-text field and then come up with some policy for the Archive and curate these things upon submissions. @TCLamnidis what do you think?

TCLamnidis commented 3 months ago

afaik, hg19 and hs37d5 are identical in chromosomes 1-22,X,Y with only differences in mtDNA and added contigs. I think the real question is whether we want to make poseidon species agnostic, or specific to humans/hominins. I generally think that it is better to add fields that get validated, as I am not a fan of freetext. Imo, it doesnt add any functionality that is not already there implicitly. If we go down the human-specific route, I think this should be a validated choice field allowing specific values for GRCh37 and GRCh38 for now. Any Species-agnostic solution would have to be a freetext field ofc.

stschiff commented 3 months ago

OK. I think regarding the species-question, I would simply say that we specify in the schema that if it's human, then it must follow a certain naming scheme. If it's not human, it can be anything? That would be validatable using a new Species field.

Regarding the assembly name. @TCLamnidis the issue is that there isn't a single "GRCh38" or the like. There are patch-versions, too, as you can see in the "Revision history" (bottom of this webpage): https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.26/

So we would then really have to be quite specific and ask for the exact patch-version, so something like "GRCh38.p14". But then this is what I meant: Most people simply don't know which version they used, and they may never find out, because either their mapping command lines may be in limbo, or even if they still have all the Eager runs, the FASTA-files they referenced to may be lost or something.

It's really a bit of a tricky question.

nevrome commented 3 months ago

I'm against a free text field. A clear accession number adds much more value and if "in the EVA archive, every VCF dataset has a 'Genome Assembly' metadata field specifying the accession number of the reference genome used", as the reviewer writes, then I don't think we should expect less.

But I also understand Stephan's practical concern of people not knowing what they used for samples published in the past. For this case we could recommend certain patch releases most representative for a given main release. Or allow a range of releases. Or add not one, but two columns: Reference_Genome_Assembly and Reference_Genome_Assembly_Accession and specify some categories/a schema for the former. Anything but a free text field, really.

Any Species-agnostic solution would have to be a freetext field ofc.

Why are there no accession IDs for animal reference genomes? You know my ignorance on these topics, but here's an accession number of a sheep reference genome: GCA_000298735.1 by the "International Sheep Genome Consortium".

stschiff commented 3 months ago

I like these ideas. You're right, ultimately expecting a concrete assembly ID is the right thing to do. Perhaps we can actually find out what our ominous "hg19" or "hs37d5" genomes actually are and then simply add these assemblies for past datasets and spread the word.

And I agree this could simply extend towards non-human species, I think!

poseidon-framework / poseidon-schema

[Review recommendation] Document the reference genome used for each sample #79