legumeinfo / datastore-specifications

Specifications for directory naming, file naming, file contents in the LIS datastore
2 stars 0 forks source link

Should we have marker FASTA collections? #24

Open sammyjava opened 2 years ago

sammyjava commented 2 years ago

@adf-ncgr said: OK, I haven't actually gotten any farther on this, but it reminds about an issue I wanted to raise in general. Our current marker specification is all about the markers in the context of a genome. This is useful to be sure, but may be problematic in cases where markers are specified as flanking sequence and some of those sequences have not yet been anchored uniquely in a genome. For example (the ones that got me thinking about this when a user asked about alfalfa marker sequences): https://alfalfatoolbox.org/filebrowser/download/175 https://alfalfatoolbox.org/filebrowser/download/176 Since mapping marker sequences to a genome "B" may not produce the same result as mapping marker sequences to genome "A" then projecting them to "B" by means of aligning A->B (especially for those that don't map to A in the first place), it may be useful to formalize a marker sequence fasta convention as well as the gff3 representation. We may already have some non-formal examples of this such as https://data.legumeinfo.org/Arachis/GENUS/markers/mixed.mrk.Axiom_Arachis_58K_SNP/

FWIW, I know that the cowpea group has been assiduously reviewing the mappings represented in https://data.legumeinfo.org/Vigna/unguiculata/markers/IT97K-499-35.gnm1.mrk.Cowpea1MSelectedSNPs/ using the flanking sequences from the chip design and will likely publish an updated version (though I'm pretty sure the ones I failed to find in the current mapping will still not be present, since I think they were not from the chip).

to which @cann0010 replied: About formalizing "a marker sequence fasta convention as well as the gff3 representation": I agree that these should be accommodated, though I imagine this as an optional extra file type -- probably either just a standard fasta file with marker names as the fasta IDs, or with alleles specified with e.g. "[A/G] at the variant site. I believe we have one such marker file in the DS currently (I am surprised we don't have more). Phaseolus/vulgaris/genetic/mixed.gen.Blair_Cortés_2018/phavu.mixed.gen.Blair_Cortés_2018.flanking_seq.fna.gz

to which @adf-ncgr replied: Thanks @cann0010 ! I was also a bit surprised we don't have more, although I did find a couple of others using some find-based guesswork: ./Cajanus/cajan/markers/mixed.mrk.1drZ/cajca.mixed.mrk.1drZ.cajan_v2_primers.txt.gz ./Arachis/GENUS/markers/mixed.mrk.Axiom_Arachis_58K_SNP/arachis.mixed.mrk.Axiom_Arachis_58K_SNP.flank_seq.tsv.gz note that the former is primer sequence pairs, similar to one of the alfalfa example whereas the arachis one is more like [A/G] at the variant site (although not as fasta).

anyway, because this representation would be genome-independent, I'm imagining it would live separate from the marker gffs (as in the above examples, but probably under markers rather than genetic?); and then gff marker mapping files derived using it would be as we currently have them, probably with some explicit reference to the sequence collection in their README. I think we'll be getting some more alfalfa markers from the Breeding Insight group and would like to handle them in some similar way, so we can figure out a good protocol for dealing with mapping to the growing number of autotetraploid genomes.

to which @sammyjava replied: Yeah, presumably under /markers/. Since these would be marker-only data, not specific to a strain or genome assembly, I'd think the collections would have a name like mixed.mrk.Blair_Cortés_2018 or mixed.mrk.Axiom_Arachis_58K.

adf-ncgr commented 2 years ago

@sammyjava your proposal seems sensible, but it would mean that we'd have two distinct types of collection under markers, to be distinguished I guess by the number of dot-separated components (3 for unmapped marker sequences, 4 for marker mappings). Is that copacetic with everyone?

sammyjava commented 2 years ago

Yeah I thought of that and can't really come up with a reason for not using the directory name as informational. In fact, we already do it under /genetic/: pop.gen.Author1_Author2_year versus pop.gwas.Author1_Author2_year. (I think @cann0010 and I were a bit proud of ourselves when we realized that the middle .chunk. could be used to differentiate content, and it's been really helpful for me in building the /genetic/ collections since tree *.gwas.* is a very handy command.)

So maybe in keeping with that concept, we'd use a name like mixed.seq.Blair_Cortés_2018 under /markers/ for the FASTAs.

adf-ncgr commented 2 years ago

It was really the fact that the number of components would be different and positional components would not have the same meaning that I was concerned might be problematic. If it isn't problematic then I'd propose we drop "mixed" from the front of these, since it seems to only be there as a placeholder pretending that we need a genotype designation for everything; unless we think some would get a different value than "mixed"?

sammyjava commented 2 years ago

Ah yeah, mixed is a population, which I guess isn't relevant. We do have a variety of component number across collections, three and four (not five that I recall). Not sure what we'd prefix ".seq." with, and I think it should be something, and can't be gensp as per rule with collection names. It's a new problem having something that's just totally standalone like marker sequences.