Closed adf-ncgr closed 3 years ago
Genomes must be loaded as gensp.strain.gnm.ann.foo files and located under a v2/Genus/species folder. Taxon ID must uniquely identify the species. So:
In that case, one loads the synthetic strain genome just as one loads real strains. @cann0010
I guess the thing that prompted this was really the markers dataset: https://legumeinfo.org/data/v2/Arachis/GENUS/markers/mixed.mrk.Axiom_Arachis_58K_SNP/ which as I understand it is mapped to a synthetic combination of araip and aradu genomes (which itself doesn't actually "exist" as a genome in the datastore).
Yeah, that doesn't even seem like something worth having in the DS. Just create Axiom_Arachis_58K_SNP GFFs for each actual genome. I don't think there's any extra information in this file. The flanking sequences can be put into GFF attributes:
aradu.V14167.gnm1.mrk.Axiom_Arachis_58K_SNP.gff3.gz
araip.K30076.gnm1.mrk.Axiom_Arachis_58K_SNP.gff3.gz
araca.STRAIN.gnm1.mrk.Axiom_Arachis_58K_SNP.gff3.gz
arast.STRAIN.gnm1.mrk.Axiom_Arachis_58K_SNP.gff3.gz
works for me, but it's not my file of markers. Were there things that needed to be discussed about the interspecific maps? Is there a reason those shouldn't reference the "genus" as their "organism"?
InterMine Organism is at the species level. It has a genus and species, neither of which can be null, nor can the taxonId be null.
The interspecific maps are a problem, one solution would be to yank the organism reference from GeneticMap. I don't think there's any strong functional reason to have GeneticMap.organism. (Note that README.genotypes get loaded as populations, for historical reasons, there is a very simple Population class that just holds the string.)
[ ] Genetic Map
I guess that would also mean dropping QTL.organism. Marker.organism can be left empty with these loads, which will get populated when the marker GFF gets loaded for the particular genome, as long as there is only one. Otherwise, we'll hit a merge collision on that marker. Probably best to always use marker maps to hypogaea in the Arachis case, or give them different names for the different GFFs.
I think it's not inappropriate (conceptually) to have sequences that belong to a genus. Both repeat sequences and genetic markers are valid examples. Sure, we could assign them, post-hoc, a particular accession; but they should be applicable across most everything in the genus, and may have been identified as common elements from many accessions. I don't want to exclude them from the Data Store; the repeats, in particular, are quite important for annotation projects.
They could be excluded from the Mine ... or maybe your solution, Sam (put them in a non-SequenceFeature class).
@cann0010 agreed that having certain things assigned at a genus level is not conceptually inappropriate (probably no more so than species, anyway- species don't have genomes, individuals have genomes, etc.).
I would like to better understand the plan for representing repeat data. The current situation is a little confusing to me; e.g. we have repeat fastas under both GENUS/repeats and hypogaea/repeats but from the sequence names it's not clear the latter are really species-specific since they have both Ah and Ad named elements; also the former seem to have a sort of full-yuck applied but the latter do not. Per a peanutbase user request I just posted a gff file of repeatmasker output which I posted to annotations, but it would probably be good for us to agree that this is indeed the right place for it (and not under repeats folder for example). Perhaps this is worth its own issue.
Incidentally, the new RepeatModeler2 tool seems to be doing a fair job and is relatively easy to run on genomes, so have been considering the value of doing something a bit more systematic than we currently have. But, I don't necessarily see all the ramifications of such a plan. An agenda item for another week...
I would like to better understand the plan for representing repeat data. The current situation is a little confusing to me; e.g. we have repeat fastas under both GENUS/repeats and hypogaea/repeats but from the sequence names it's not clear the latter are really species-specific since they have both Ah and Ad named elements; also the former seem to have a sort of full-yuck applied but the latter do not. Per a peanutbase user request I just posted a gff file of repeatmasker output which I posted to annotations, but it would probably be good for us to agree that this is indeed the right place for it (and not under repeats folder for example). Perhaps this is worth its own issue.
Yes, clearly it is a separate issue. Please make it so. (You'll find Reference in new issue in the three-dot menu.)
I'm closing this for now. I don't see any actionable "solution" to an issue. Reopen with a specific issue and assignee if appropriate.
@sammyjava says we need to clean this up; for example, the current taxid given is actually for hypogaea, and the current species name is "sp" which doesn't match the folder name GENUS. I thought the latter was an accepted special case but perhaps not.