legumeinfo / datastore-issues

mostly for issues pertaining to the content of the legumeinfo datastore; may also relate to characteristics of its user interface or managing the mirroring process to the legfed instance

Other

1 stars 0 forks source link

New genome+annotations for Glycine max Wm82_NJAU.gnm1 #190

Open StevenCannon-USDA opened 10 months ago

StevenCannon-USDA commented 10 months ago

Main steps for adding new genome and annotation collections

Genus/species/collection names:

Glycine/max/genomes/Wm82_NJAU.gnm1.N4GV
Glycine/max/annotations/Wm82_NJAU.gnm1.ann1.KM71
[X] Add collection(s) to the Data Store (annex)
[X] Validate the README(s)
[x] Update about_this_collection.yml
[x] Calculate AHRD functional annotations
[x] Calculate gene family assignments (.gfa)
[ ] Add to pan-gene set
[ ] Load relevant mine
[ ] Add BLAST targets
[x] Incorporate into GCV
[ ] Update the jekyll collections listing
[ ] Update browser configs
[x] run BUSCO

adf-ncgr commented 10 months ago

@StevenCannon-USDA this is probably just a little glitch in processing, but FYI the file glyma.Wm82_NJAU.gnm1.ann1.KM71.protein_primary.faa.gz appears not to actually be compressed despite the suffix. Looks like samtools faidx in this case will just treat it as regular fasta and produce a fai but not a gzi file. I'll fix it, just wanted to let you know in case there's some aspect of processing that needs to be revisited (doesn't appear to have happened in the other annotation sets or with any of the other fasta in this one, though, so probably just a quirk of fate).

StevenCannon-USDA commented 10 months ago

@adf-ncgr Thanks; will fix this upstream.

maxglycine commented 10 months ago

Steven: I am not sure what you want me to review. Looking at the check boxes I have the following questions

How do you get the license tag number for the names? Is there a protocol? Include path to the protocol
How do you get to the annex? Specifiy path
How do you "validate" the README? Is there a protocol? Include path to the protocol
Where is the specification for the "about_this_collection.yml" Include path to the protocol
How do you calculate the gene family assignments? Is there a protocol? Include path to the protocol
How do you "add to pan-gene set"? Is there a protocol? Include path to the protocol
How do you "load relevant mine"? Is there a protocol? Include path to the protocol
How do you "incorporate into GCV" Is there a protocol? Include path to the protocol
Is there a protocol for updating the Jbrowse config? Include path to the protocol
Is there a protocol for running MultiQC/DSCensor? Include path to the protocol

StevenCannon-USDA commented 10 months ago

@maxglycine - Not looking for a review actually. Just wanted you to be aware that this collection was underway. This is the generic template for getting genome+accession collections loaded: to the Data Store, the Mines, to GCV, to SequenceServer, etc. But your questions about protocols are valid. A work in progress. The protocols are being collected here: https://github.com/legumeinfo/datastore-specifications/tree/main/PROTOCOLS

adf-ncgr commented 10 months ago

@StevenCannon-USDA the AHRD/BUSCO/GFA files are now in the annex folders. Shall we move these into v2 and proceed with downstream steps?

StevenCannon-USDA commented 10 months ago

Shall we move these into v2 and proceed with downstream steps?

Thank you - and yes please!