legumeinfo / datastore-issues

mostly for issues pertaining to the content of the legumeinfo datastore; may also relate to characteristics of its user interface or managing the mirroring process to the legfed instance
Other
1 stars 0 forks source link

information under Arachis/GENUS needs review in order to pass muster for loading into PeanutMine #17

Closed adf-ncgr closed 3 years ago

adf-ncgr commented 3 years ago

@sammyjava says we need to clean this up; for example, the current taxid given is actually for hypogaea, and the current species name is "sp" which doesn't match the folder name GENUS. I thought the latter was an accepted special case but perhaps not.

sammyjava commented 3 years ago

Genomes must be loaded as gensp.strain.gnm.ann.foo files and located under a v2/Genus/species folder. Taxon ID must uniquely identify the species. So:

  1. a species name must be invented (e.g. "sp", "synth", etc.) which must lead to a unique gensp
  2. a strain name must be invented (because strain assemblies/annotations are loaded into the mines)
  3. a taxonomy ID that is not currently in NCBI must be invented (NOT the ID for the genus, that has an existing meaning and should not be used for a synthetic strain)

In that case, one loads the synthetic strain genome just as one loads real strains. @cann0010

adf-ncgr commented 3 years ago

I guess the thing that prompted this was really the markers dataset: https://legumeinfo.org/data/v2/Arachis/GENUS/markers/mixed.mrk.Axiom_Arachis_58K_SNP/ which as I understand it is mapped to a synthetic combination of araip and aradu genomes (which itself doesn't actually "exist" as a genome in the datastore).

sammyjava commented 3 years ago

Yeah, that doesn't even seem like something worth having in the DS. Just create Axiom_Arachis_58K_SNP GFFs for each actual genome. I don't think there's any extra information in this file. The flanking sequences can be put into GFF attributes:

aradu.V14167.gnm1.mrk.Axiom_Arachis_58K_SNP.gff3.gz
araip.K30076.gnm1.mrk.Axiom_Arachis_58K_SNP.gff3.gz
araca.STRAIN.gnm1.mrk.Axiom_Arachis_58K_SNP.gff3.gz
arast.STRAIN.gnm1.mrk.Axiom_Arachis_58K_SNP.gff3.gz
adf-ncgr commented 3 years ago

works for me, but it's not my file of markers. Were there things that needed to be discussed about the interspecific maps? Is there a reason those shouldn't reference the "genus" as their "organism"?

sammyjava commented 3 years ago

InterMine Organism is at the species level. It has a genus and species, neither of which can be null, nor can the taxonId be null.

sammyjava commented 3 years ago

The interspecific maps are a problem, one solution would be to yank the organism reference from GeneticMap. I don't think there's any strong functional reason to have GeneticMap.organism. (Note that README.genotypes get loaded as populations, for historical reasons, there is a very simple Population class that just holds the string.)

[ ] Genetic Map

sammyjava commented 3 years ago

I guess that would also mean dropping QTL.organism. Marker.organism can be left empty with these loads, which will get populated when the marker GFF gets loaded for the particular genome, as long as there is only one. Otherwise, we'll hit a merge collision on that marker. Probably best to always use marker maps to hypogaea in the Arachis case, or give them different names for the different GFFs.

StevenCannon-USDA commented 3 years ago

I think it's not inappropriate (conceptually) to have sequences that belong to a genus. Both repeat sequences and genetic markers are valid examples. Sure, we could assign them, post-hoc, a particular accession; but they should be applicable across most everything in the genus, and may have been identified as common elements from many accessions. I don't want to exclude them from the Data Store; the repeats, in particular, are quite important for annotation projects.

They could be excluded from the Mine ... or maybe your solution, Sam (put them in a non-SequenceFeature class).

adf-ncgr commented 3 years ago

@cann0010 agreed that having certain things assigned at a genus level is not conceptually inappropriate (probably no more so than species, anyway- species don't have genomes, individuals have genomes, etc.).

I would like to better understand the plan for representing repeat data. The current situation is a little confusing to me; e.g. we have repeat fastas under both GENUS/repeats and hypogaea/repeats but from the sequence names it's not clear the latter are really species-specific since they have both Ah and Ad named elements; also the former seem to have a sort of full-yuck applied but the latter do not. Per a peanutbase user request I just posted a gff file of repeatmasker output which I posted to annotations, but it would probably be good for us to agree that this is indeed the right place for it (and not under repeats folder for example). Perhaps this is worth its own issue.

Incidentally, the new RepeatModeler2 tool seems to be doing a fair job and is relatively easy to run on genomes, so have been considering the value of doing something a bit more systematic than we currently have. But, I don't necessarily see all the ramifications of such a plan. An agenda item for another week...

sammyjava commented 3 years ago

I would like to better understand the plan for representing repeat data. The current situation is a little confusing to me; e.g. we have repeat fastas under both GENUS/repeats and hypogaea/repeats but from the sequence names it's not clear the latter are really species-specific since they have both Ah and Ad named elements; also the former seem to have a sort of full-yuck applied but the latter do not. Per a peanutbase user request I just posted a gff file of repeatmasker output which I posted to annotations, but it would probably be good for us to agree that this is indeed the right place for it (and not under repeats folder for example). Perhaps this is worth its own issue.

Yes, clearly it is a separate issue. Please make it so. (You'll find Reference in new issue in the three-dot menu.)

sammyjava commented 3 years ago

I'm closing this for now. I don't see any actionable "solution" to an issue. Reopen with a specific issue and assignee if appropriate.