poseidon-framework / poseidon-schema

An archaeogenetic genotype data organisation file format
0 stars 1 forks source link

How to support multiple SNP-sets #77

Open stschiff opened 4 months ago

stschiff commented 4 months ago

Right now, Poseidon is officially flexible with respect to the Snp-Set of the genotype files, but the archives are of course not. All public archives currently support only the 1240K format, or the HO-subset of it.

It would be desirable in the future to support more call-sets, in particular in light of more shotgun-sequencing being done. In light of the fact that our Poseidon-IDs are supposed to be unique per archive, which excludes the option to place multiple packages with different call-sets into the same archive, there are two basic options to consider:

Option 1: Split Archives to allow other than 1240K call sets, for example a "Community-Archive 1000G calls" or so. This does not require any change in the schema, and would be straight-forward to do with current infrastructure. Pros: Simple, non-breaking and in principle immediately doable. Cons: Meta-data will be duplicated across archives, which causes redundancy and may require complex syncing infrastructure to update Janno-files across archives.

Option 2: Extend the schema to allow for multiple genotype-datasets within one package This is currently not possible, but is in principle not hard to implement. We would simply allow the YAML-schema to list multiple genotype datasets, each with their own snpset and separate genotype files. There is one catch: The Janno-File contains several columns that are specific to the call-set (Genotype_Ploidy, Data_Preparation_Pipeline_URL, Nr_SNPs, Coverage_on_Target_SNPs). These can easily be made list-columns, of course, which would be non-breaking. Pros: Would be a non-redundant solution with respect to package-metadata and Janno-files, as these would not be duplicated. Cons: Would require some additional implementation in the server and trident list and forge functionality. A minimal solution would be to ignore any call-set after the first in trident, and see how we can support secondary call-sets later on. However, at least with fetch, large-scale adoption of hosting multiple call-sets would result in much larger downloads of packages. So perhaps one should somehow change the server software to create multiple zip-files for download, with or without secondary call-sets.

stschiff commented 1 month ago

OK, we briefly discussed this in our Meeting on September 13. Given our limited dev-resources, Option 1 is more likely.

One compromise would be to expand Minotaur once Eager3 is out and create additional Pull-Downs, perhaps even with imputation, and then release some compromise archive, with a snpSet of perhaps something like 5 million common SNPs as a subset from 1000 Genomes.