legumeinfo / mine-issues

Report ALL issues on LIS mines here! Regardless of which mine you found it on!
2 stars 0 forks source link

5.1.0.4 updates #128

Open sammyjava opened 11 months ago

sammyjava commented 11 months ago
sammyjava commented 11 months ago

Update Plant Reactome and add Ensembl attribute links to web.properties using Pathway.primaryIdentifier and Gene.ensemblName

sammyjava commented 9 months ago

Note to world: I need to load the transcripts in the pan-gene sets as MRNA objects because that's how they're loaded from the GFFs (with the mRNA type identifier). Otherwise there are loading conflicts.

sammyjava commented 9 months ago

Note: the post-processor will only assign Gene.panGeneSets and Protein.panGeneSets for genes and proteins that have been loaded from an annotation collection. Obviously for genes (need the GFF) but not obvious for proteins (same identifier) but that's how I think it should be done. Orphaned transcripts loaded by lis-pangene will remain orphaned.

sammyjava commented 9 months ago

lis-populate-pangeneset-genes-proteins is written and tested on dev ArachisMine 5.1.0.4. It took about two hours. An example with some extra non-mine transcripts is https://mines.dev.lis.ncgr.org/arachismine/pangeneset:Arachis.pan2.pan00009

@adf-ncgr @StevenCannon-USDA this makes me think about how we're loading pan-gene sets versus our GFA files for gene family assignments. I'd prefer a similar file in the annotation collection (pgsa.tsv) providing assignments of those annotated genes to pan-gene sets. The way it is now, we load the mine with a lot of stray transcripts (47,956 in ArachisMine, over 10% of the total) that have no information other than an identifier, because they're in the pan-gene set but not in the annotations for that genus. I don't like having empty objects in the mines, and it can throw off queries when you do have them because a null attribute can kill a query.

sammyjava commented 9 months ago

Go back to loading CDS locations from the GFF and the full sequence from the FASTA, merging on the ID from the parent identifier. Drop CDSRegion from the model.

GFF loader procedure:

sammyjava commented 9 months ago

Model Update There are a number of references and collections defined in genomic-additions.xml that are, I think, legacy from the chado database model. I'd like to get things more in line with the default SO model. LIS deviations are noted. References and collections which we don't load (they can be retrieved with a chained path query) are noted.

CDS (SequenceFeature)

Exon (SequenceFeature)

Gene (SequenceFeature)

Protein (BioEntity)

Transcript (SequenceFeature)

UTR (SequenceFeature) FivePrimeUTR ThreePrimeUTR

sammyjava commented 8 months ago

Wrote a completely new lis-create-intergenic-region-features post-processor to be run instead of the main mine one, which was really clunky. This one processes supercontigs, uses a parallel stream to build the IntergenicRegion objects, and runs pretty fast. Oh, it also stores the regions with assemblyVersion and annotationVersion attributes, and the primaryIdentifier is a concatenation of the adjacent gene identifiers, so two annotations on an assembly will get two sets of intergenic regions.

sammyjava commented 8 months ago

Updated main create-gene-flanking-features to use parallel stream when creating the features.