legumeinfo / datastore-specifications

Specifications for directory naming, file naming, file contents in the LIS datastore
2 stars 0 forks source link

RFO: associated gene collections (e.g. expression) may contain ONLY genes present in gene_models_main #43

Open sammyjava opened 9 months ago

sammyjava commented 9 months ago

This seems like an obvious requirement, but I thought I'd make it visible in this RFO. It came up recently because @adf-ncgr has analyzed some arahy.Tifrunner.gnm2 expression data for not only genes contained in the gene_models_main GFF but also "low quality" genes not present in that GFF.

Since the mines load only genes present in gene_models_main from annotations, it makes sense to have a Datastore expression set contain only those genes. Otherwise the expression loader will create orphaned "low quality" genes (in this case) in the mine, and it's just not a great idea in general, in my opinion. If a gene isn't "good enough" to be present in gene_models_main, it isn't "good enough" to be contained in other main Datastore collections.

One can always toss a file into the "annex" containing other genes if one wants to.