5.1.0.4 updates - Githubissues

legumeinfo / mine-issues

Report ALL issues on LIS mines here! Regardless of which mine you found it on!

2 stars 0 forks source link

5.1.0.4 updates #128

Open sammyjava opened 11 months ago

sammyjava commented 11 months ago

[x] Gene.ensemblName added by lis-annotation loader using munging wizardry in DatastoreUtils.java
[x] lis-create-pathways post-processor using above ensemblName attribute
[x] update Pathway attribute linkouts to Plant Reactome (stableIdentifier --> primaryIdentifier) (checklist below)
[x] add Gene Ensembl Plants linkouts (checklist below)
[x] remove pathway file validation
[x] remove Pathway.stableIdentifier
[x] remove pathway loading in lis-annotation
[x] add Trait.organism and implement in lis-gwas and lis-qtl loaders.
[x] add Transcript.panGeneSets and load only Transcripts in lis-pangene since that is what is actually in the collection
[x] assign Gene.panGeneSets and Protein.panGeneSets in a post-processor
[x] remove CDSRegion in favor of loading CDS.locations as per mine standard (see below)
[x] reduce genomic model to SO spec (remove some extra SequenceFeature collections, references; see below)
[x] set SequenceFeature.name to value of secondaryIdentifier if Name attribute not supplied in GFF
[x] add Annotatable.dataSets since every Annotatable has them!
[x] remove BioEntity.dataSets (see above)
[x] DataSet.entities replaces DataSet.bioEntities (see above)
[x] SyntenyBlock extends Annotatable (it already had publications and dataSets)

sammyjava commented 11 months ago

Update Plant Reactome and add Ensembl attribute links to web.properties using Pathway.primaryIdentifier and Gene.ensemblName

[x] minimine
[x] aeschynomenemine
[x] arachismine
[x] cajanusmine
[x] cicermine
[x] glycinemine
[x] legumemine
[x] lensmine
[x] lupinusmine
[x] medicagomine
[x] minimine
[x] phapanmine
[x] phaseolusmine
[x] vignamine

sammyjava commented 9 months ago

Note to world: I need to load the transcripts in the pan-gene sets as MRNA objects because that's how they're loaded from the GFFs (with the mRNA type identifier). Otherwise there are loading conflicts.

sammyjava commented 9 months ago

Note: the post-processor will only assign Gene.panGeneSets and Protein.panGeneSets for genes and proteins that have been loaded from an annotation collection. Obviously for genes (need the GFF) but not obvious for proteins (same identifier) but that's how I think it should be done. Orphaned transcripts loaded by lis-pangene will remain orphaned.

sammyjava commented 9 months ago

lis-populate-pangeneset-genes-proteins is written and tested on dev ArachisMine 5.1.0.4. It took about two hours. An example with some extra non-mine transcripts is https://mines.dev.lis.ncgr.org/arachismine/pangeneset:Arachis.pan2.pan00009

@adf-ncgr @StevenCannon-USDA this makes me think about how we're loading pan-gene sets versus our GFA files for gene family assignments. I'd prefer a similar file in the annotation collection (pgsa.tsv) providing assignments of those annotated genes to pan-gene sets. The way it is now, we load the mine with a lot of stray transcripts (47,956 in ArachisMine, over 10% of the total) that have no information other than an identifier, because they're in the pan-gene set but not in the annotations for that genus. I don't like having empty objects in the mines, and it can throw off queries when you do have them because a null attribute can kill a query.

sammyjava commented 9 months ago

Go back to loading CDS locations from the GFF and the full sequence from the FASTA, merging on the ID from the parent identifier. Drop CDSRegion from the model.

GFF loader procedure:

get CDS primaryIdentifier from the parent attribute
create/update the CDS object by:
- add this range to locations collection
- update chromosomeLocation/supercontigLocation as the full range expands

sammyjava commented 9 months ago

Model Update There are a number of references and collections defined in genomic-additions.xml that are, I think, legacy from the chado database model. I'd like to get things more in line with the default SO model. LIS deviations are noted. References and collections which we don't load (they can be retrieved with a chained path query) are noted.

CDS (SequenceFeature)

gene reference
protein reference
transcript reference

Exon (SequenceFeature)

gene reference
transcripts collection

Gene (SequenceFeature)

upstream/downstreamIntergenicRegion reference
proteins collection
CDSs collection
exons collection
pathways collection
transcripts collection
introns collection
flankingRegions collection
proteinDomains collection
geneFamilyAssignments collection LIS
panGeneSets collection LIS

Protein (BioEntity)

CDSs collection
genes collection
pathways collection
transcripts collection transcript reference (LIS deviation)
geneFamilyAssignments collection LIS
panGeneSets collection LIS
proteinMatches collection LIS

Transcript (SequenceFeature)

gene reference
protein reference
exons collection
CDSs collection
UTRs collection
introns collection
panGeneSets collection LIS

UTR (SequenceFeature) FivePrimeUTR ThreePrimeUTR

gene reference
transcripts collection

sammyjava commented 8 months ago

Wrote a completely new lis-create-intergenic-region-features post-processor to be run instead of the main mine one, which was really clunky. This one processes supercontigs, uses a parallel stream to build the IntergenicRegion objects, and runs pretty fast. Oh, it also stores the regions with assemblyVersion and annotationVersion attributes, and the primaryIdentifier is a concatenation of the adjacent gene identifiers, so two annotations on an assembly will get two sets of intergenic regions.

sammyjava commented 8 months ago

Updated main create-gene-flanking-features to use parallel stream when creating the features.