legumeinfo / datastore-issues

mostly for issues pertaining to the content of the legumeinfo datastore; may also relate to characteristics of its user interface or managing the mirroring process to the legfed instance

Other

1 stars 0 forks source link

Cercis canadensis genome #159

Open adf-ncgr opened 1 year ago

adf-ncgr commented 1 year ago

I think this is one that @cann0010 already has started to some extent, but I noticed that phytozome recently posted:

Cercis canadensis V3.1	eastern redbud	Mar 8, 2023
Cercis canadensis HAP2 V3.1	eastern redbud	Mar 8, 2023

Cercis canadensis V3.1 eastern redbud Mar 8, 2023 Cercis canadensis HAP2 V3.1 eastern redbud Mar 8, 2023

and I'm sure we don't want to lag too far behind

Main steps for adding new genome and annotation collections

Genus/species/collection names:

What are the collection types and names? Example:

Cercis/canadensis/genomes/ISC453364.gnm3.GWXB
Cercis/canadensis/annotations/ISC453364.gnm3.ann1.3N1M
[x] Add collection(s) to the Data Store
[x] Validate the README(s)
[x] Update about_this_collection.yml
[x] Calculate AHRD functional annotations
[x] Calculate gene family assignments (.gfa)
[x] Load relevant mine
[ ] Add BLAST targets
[x] Incorporate into GCV
[x] Update the jekyll collections listing
[x] Update browser configs
[x] run BUSCO

StevenCannon-USDA commented 1 year ago

I have added the assembly and annotations to the Data Store. Handing off to you, @adf-ncgr, for next steps.

adf-ncgr commented 1 year ago

Thanks @cann0010 have picked this up and started checking items off. I removed the "add to pan genes" task since I don't think it's relevant for this one, but feel free to add that back if I'm wrong about that.

adf-ncgr commented 1 year ago

@cann0010 there's something that looks a bit off in the derived fastas; instead of full yuck ids they look like this: >Cecan.H010200.1 HASH UNDEFINED I can of course add full yuck but not sure if there's something else that ought to be done about this (like defining a hash, I guess).

StevenCannon-USDA commented 1 year ago

Thanks for reporting the HASH UNDEFINED issue, @adf-ncgr. That indicates a problem with the featid_map relative to the feature names -- probably having to do with the suffix that JGI applies, e.g. Cecan.1G000400.2.V3.1.CDS.2 I'll try to fix this sometime this weekend. In the meantime, I have moved the new collections (annotations and genome) back to "private."

StevenCannon-USDA commented 1 year ago

@adf-ncgr - I think have fixed this HASH UNDEFINED problem - which was caused by JGI's application of ".V3.1" in all feature IDs in the GFFs, but not in the fasta files. The simplest fix was to remove that suffix from a version of the upstream GFF.

The new collection is here: /usr/local/www/data/private/Cercis/canadensis/annotations/ISC453364.gnm3.ann1.3N1M

I believe your cerca.ISC453364.gnm3.ann1.3N1M.gene_models_main.gff3.gz (with functional info) can be used to replace the new (but function-less) cerca.ISC453364.gnm3.ann1.3N1M.gene_models_main.gff3 file. I have not compressed or indexed the new files. I think the gfa, iprscan, and BUSCOs should be OK, but I'll wait for your verification.

adf-ncgr commented 1 year ago

@cann0010 yes I think that all sounds right. I had already post-fixed the gfa and iprscan with the missing prefixes- forgot about BUSCO needing it in the full_table output, but that's now fixed too. I guess the moral of this story is that JGI believes that the fasta headers should reflect the gff Name attribute rather than the ID, which is certainly a defensible reading of the gff standard, but not the way we've been doing things. Anyway, feel free to move back to public when ready.