legumeinfo / datastore-issues

mostly for issues pertaining to the content of the legumeinfo datastore; may also relate to characteristics of its user interface or managing the mirroring process to the legfed instance
Other
1 stars 0 forks source link

Arachis hypogaea chromosome names #157

Open StevenCannon-USDA opened 1 year ago

StevenCannon-USDA commented 1 year ago

The chromosome prefixes from the Arachis genome initiative are funky -- at least relative to the Data Store funk. They have the form "Gensp.[AB]*\d\d". I think this derives from our proto-yuckification (with addition of Aradu, Araip, or Arahy). The problem is that the prefixes (as we have pulled the data into the Data store) introduce an additional dot. That will be trouble for some downstream processes (pandagma; maybe the mines?)

cd /usr/local/www/data/v2/Arachis
head -1 */genomes/*/*genome_main.fna.gz.fai | awk 'NF==5 && $1~/Ara/ {print " ", $1}'
  aradu.V14167.gnm1.Aradu.A01
  arahy.Tifrunner.gnm1.Arahy.01
  arahy.Tifrunner.gnm2.Arahy.01
  araip.K30076.gnm1.Araip.B01

Arguably, we should have replaced the proto-prefix with our current prefix, giving one of the following:

  aradu.V14167.gnm1.chrA01
  arahy.Tifrunner.gnm1.chr01
  arahy.Tifrunner.gnm2.chr01
  araip.K30076.gnm1.chrB01
  aradu.V14167.gnm1.chr01
  arahy.Tifrunner.gnm1.chr01
  arahy.Tifrunner.gnm2.chr01
  araip.K30076.gnm1.chr01

Doing either of these would break stuff -- but I am tempted to undertake the change, in pursuit of site-wide normalcy.

What do you think, @sdash-github ? @sammyjava ? @adf-ncgr ?

adf-ncgr commented 1 year ago

extra dots are not forbidden, as long as they don't occur within the context of full yuck components. That is, I can't call something arahy.Tifrunner.gnm1.2.ann3.14.chr1 but I can call it arahy.Tifrunner.gnm1_2.ann3_14.chr1.1 in fact the chr1.1 shows up in at least one place, which is the "allele-aware" alfalfa assembly that uses .1 .2 .3 .4 to discriminate the different haplotypes.

This doesn't mean we can't change the current naming, e.g. to get rid of redundant yuck (even though arahy is so nice you need to say it twice). But I think tools shouldn't insist on not having dots in the post-yuck part of the name.

sammyjava commented 1 year ago

Not a problem for the mines as per @adf-ncgr above but I still don't like it. :)

adf-ncgr commented 1 year ago

nobody objects to .1 suffixes for isoforms- same principle applies here IMHO (even though I still don't like isoforms!)

sdash-github commented 1 year ago

Just a commen about the benefit: It will be very useful having the Chr# mentioned in the gene model. I and Paul felt the lack of it when trying to look at them in the family tree. For Arachis there wasn't a clue which chromosome they belong to, to eliminate further checking, until we visited GCV (our next check) while other legumes had the chr clue.

Similar difficulty while looking for  writing instructions for identifying homoeolog of a gene in arahy.

On 3/2/23 3:05 PM, Steven Cannon wrote:

The chromosome prefixes from the Arachis genome initiative are funky -- at least relative to the Data Store funk. They have the form "Gensp.[AB]*\d\d". I think this derives from our proto-yuckification (with addition of Aradu, Araip, or Arahy). The problem is that the prefixes (as we have pulled the data into the Data store) introduce an additional dot. That is problematic for some downstream processes (pandagma; maybe the mines?)

|cd /usr/local/www/data/v2/Arachis head -1 /genomes//*genome_main.fna.gz.fai | awk 'NF==5 && $1~/Ara/ {print " ", $1}' aradu.V14167.gnm1.Aradu.A01 arahy.Tifrunner.gnm1.Arahy.01 arahy.Tifrunner.gnm2.Arahy.01 araip.K30076.gnm1.Araip.B01 |

Arguably, we should have replaced the porto-prefix with our current prefix, giving one of the following:

|aradu.V14167.gnm1.chrA01 arahy.Tifrunner.gnm1.chr01 arahy.Tifrunner.gnm2.chr01 araip.K30076.gnm1.chrB01 | |aradu.V14167.gnm1.chr01 arahy.Tifrunner.gnm1.chr01 arahy.Tifrunner.gnm2.chr01 araip.K30076.gnm1.chr01 |

Doing either of these would break stuff -- but I am tempted to undertake the change, in pursuit of site-wide normalcy.

What do you think, @sdash-github https://github.com/sdash-github ? @sammyjava https://github.com/sammyjava ? @adf-ncgr https://github.com/adf-ncgr ?

— Reply to this email directly, view it on GitHub https://github.com/legumeinfo/datastore-issues/issues/157, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4A46YJ2SDKBZ5EA57SUADW2EDRZANCNFSM6AAAAAAVN35NLU. You are receiving this because you were mentioned.Message ID: @.***>

adf-ncgr commented 1 year ago

FWIW, I'm planning to use names like "Ah18g066100" in the new annotations, but I don't think we ought to retrofit the old ones, too many things outside LIS already using the given names. But you still won't be able to assume Ah08g066100 is the homoeolog of Ah18g066100 ...

sdash-github commented 1 year ago

Ah18g and  Ah08g, and not the 06610(random string in the current annotation) part, were our clues, only if they were in the same family tree. We also checked if there were orthologs in aradu and araip chr08 as closer members in the tree to support the clue. Basically it would help us eliminate those genes in the tree  located on chr09 or chr05 which is not possible now. Our last checking was going to GCV and checking the relative location, coordinates on the chr. As long as the chr# shows up in three gene-ids it helps unnecessary further checking.

Currently, Cajca, Cicar, Arahy/du/ip do not have the chr# info in the trees. A not so exact example of which spp do not have chr# in gene models: https://legacy.legumeinfo.org/chado_phylotree/legfed_v1_0.L_0YC55C

On 3/2/23 5:42 PM, adf-ncgr wrote:

FWIW, I'm planning to use names like "Ah18g066100" in the new annotations, but I don't think we ought to retrofit the old ones, too many things outside LIS already using the given names. But you still won't be able to assume Ah08g066100 is the homoeolog of Ah18g066100 ...

— Reply to this email directly, view it on GitHub https://github.com/legumeinfo/datastore-issues/issues/157#issuecomment-1452731064, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4A462KRSWFSPD5FW6V6JDW2EV6XANCNFSM6AAAAAAVN35NLU. You are receiving this because you were mentioned.Message ID: @.***>

adf-ncgr commented 1 year ago

I see, I misunderstood the last part of your message about writing instructions for identifying homoeologs. In any case, we agree that the benefits of semantically semi-transparent ids in this case outweigh their risks (e.g. that they become obsolete as assemblies/annotations change).

StevenCannon-USDA commented 1 year ago

Not hearing any clear "Nays", I'll plan to tackle the renaming -- probably later in the month, after finishing some other projects. It will be a moderately big job, since it will affect chromosome prefixes in the assembly and annotation files, and for other files with features mapped onto the assemblies. (Actually, I will do some preparatory work now, as I work on the hypogaea annotations and the Arachis pan-genes; but that will go initially into "private").

adf-ncgr commented 1 year ago

speaking of work on the hypogaea/annotations and pangenes, I have a GCV that includes the new and old Tifrunner as well as the NCBI version of Tifrunner.gnm1 annotations and BaileyII. Will probably get the old version of the diploids in as well. Any other genomes you are planning to include in the pangenes?

StevenCannon-USDA commented 1 year ago

@adf-ncgr I am working on the NCBI version of Tifrunner.gnm1 annotations and BaileyII as we speak. They should be souschef-ified by EOD. Will then run them through pandagma, along with your Tifrunner.gnm2.ann2 candidate. I propose to demote singleton gnm2.ann2 models to "low-confidence." Separate thread though :-)

adf-ncgr commented 1 year ago

PS. this nascent GCV is based on assignment of the protein-coding genes to the legfed gene families. There are some interesting differences among the various annotation sets, concerning which I have a long email in progress (but the hijacker in me couldn't resist the opportunity to entangle this thread a bit)

StevenCannon-USDA commented 1 year ago

The conversion is started. The first assembly to get updated chromosome prefixes (Arahy.01 => chr01) is Arachis/hypogaea/genomes/Tifrunner.gnm2.J5K5/ ... for correspondence with the new annotations at Tifrunner.gnm2.ann2.PVFB/

Some relationships will be broken until the prefixes in all of the assemblies and annotations are updated and consistent. Heads-up to @sdash-github , @adf-ncgr , @sammyjava

adf-ncgr commented 1 year ago

Tifrunner.gnm2.ann1.4K0L is among the broken.