Open StevenCannon-USDA opened 1 year ago
extra dots are not forbidden, as long as they don't occur within the context of full yuck components. That is, I can't call something arahy.Tifrunner.gnm1.2.ann3.14.chr1 but I can call it arahy.Tifrunner.gnm1_2.ann3_14.chr1.1 in fact the chr1.1 shows up in at least one place, which is the "allele-aware" alfalfa assembly that uses .1 .2 .3 .4 to discriminate the different haplotypes.
This doesn't mean we can't change the current naming, e.g. to get rid of redundant yuck (even though arahy is so nice you need to say it twice). But I think tools shouldn't insist on not having dots in the post-yuck part of the name.
Not a problem for the mines as per @adf-ncgr above but I still don't like it. :)
nobody objects to .1 suffixes for isoforms- same principle applies here IMHO (even though I still don't like isoforms!)
Just a commen about the benefit: It will be very useful having the Chr# mentioned in the gene model. I and Paul felt the lack of it when trying to look at them in the family tree. For Arachis there wasn't a clue which chromosome they belong to, to eliminate further checking, until we visited GCV (our next check) while other legumes had the chr clue.
Similar difficulty while looking for writing instructions for identifying homoeolog of a gene in arahy.
On 3/2/23 3:05 PM, Steven Cannon wrote:
The chromosome prefixes from the Arachis genome initiative are funky -- at least relative to the Data Store funk. They have the form "Gensp.[AB]*\d\d". I think this derives from our proto-yuckification (with addition of Aradu, Araip, or Arahy). The problem is that the prefixes (as we have pulled the data into the Data store) introduce an additional dot. That is problematic for some downstream processes (pandagma; maybe the mines?)
|cd /usr/local/www/data/v2/Arachis head -1 /genomes//*genome_main.fna.gz.fai | awk 'NF==5 && $1~/Ara/ {print " ", $1}' aradu.V14167.gnm1.Aradu.A01 arahy.Tifrunner.gnm1.Arahy.01 arahy.Tifrunner.gnm2.Arahy.01 araip.K30076.gnm1.Araip.B01 |
Arguably, we should have replaced the porto-prefix with our current prefix, giving one of the following:
|aradu.V14167.gnm1.chrA01 arahy.Tifrunner.gnm1.chr01 arahy.Tifrunner.gnm2.chr01 araip.K30076.gnm1.chrB01 | |aradu.V14167.gnm1.chr01 arahy.Tifrunner.gnm1.chr01 arahy.Tifrunner.gnm2.chr01 araip.K30076.gnm1.chr01 |
Doing either of these would break stuff -- but I am tempted to undertake the change, in pursuit of site-wide normalcy.
What do you think, @sdash-github https://github.com/sdash-github ? @sammyjava https://github.com/sammyjava ? @adf-ncgr https://github.com/adf-ncgr ?
— Reply to this email directly, view it on GitHub https://github.com/legumeinfo/datastore-issues/issues/157, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4A46YJ2SDKBZ5EA57SUADW2EDRZANCNFSM6AAAAAAVN35NLU. You are receiving this because you were mentioned.Message ID: @.***>
FWIW, I'm planning to use names like "Ah18g066100" in the new annotations, but I don't think we ought to retrofit the old ones, too many things outside LIS already using the given names. But you still won't be able to assume Ah08g066100 is the homoeolog of Ah18g066100 ...
Ah18g and Ah08g, and not the 06610(random string in the current annotation) part, were our clues, only if they were in the same family tree. We also checked if there were orthologs in aradu and araip chr08 as closer members in the tree to support the clue. Basically it would help us eliminate those genes in the tree located on chr09 or chr05 which is not possible now. Our last checking was going to GCV and checking the relative location, coordinates on the chr. As long as the chr# shows up in three gene-ids it helps unnecessary further checking.
Currently, Cajca, Cicar, Arahy/du/ip do not have the chr# info in the trees. A not so exact example of which spp do not have chr# in gene models: https://legacy.legumeinfo.org/chado_phylotree/legfed_v1_0.L_0YC55C
On 3/2/23 5:42 PM, adf-ncgr wrote:
FWIW, I'm planning to use names like "Ah18g066100" in the new annotations, but I don't think we ought to retrofit the old ones, too many things outside LIS already using the given names. But you still won't be able to assume Ah08g066100 is the homoeolog of Ah18g066100 ...
— Reply to this email directly, view it on GitHub https://github.com/legumeinfo/datastore-issues/issues/157#issuecomment-1452731064, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4A462KRSWFSPD5FW6V6JDW2EV6XANCNFSM6AAAAAAVN35NLU. You are receiving this because you were mentioned.Message ID: @.***>
I see, I misunderstood the last part of your message about writing instructions for identifying homoeologs. In any case, we agree that the benefits of semantically semi-transparent ids in this case outweigh their risks (e.g. that they become obsolete as assemblies/annotations change).
Not hearing any clear "Nays", I'll plan to tackle the renaming -- probably later in the month, after finishing some other projects. It will be a moderately big job, since it will affect chromosome prefixes in the assembly and annotation files, and for other files with features mapped onto the assemblies. (Actually, I will do some preparatory work now, as I work on the hypogaea annotations and the Arachis pan-genes; but that will go initially into "private").
speaking of work on the hypogaea/annotations and pangenes, I have a GCV that includes the new and old Tifrunner as well as the NCBI version of Tifrunner.gnm1 annotations and BaileyII. Will probably get the old version of the diploids in as well. Any other genomes you are planning to include in the pangenes?
@adf-ncgr I am working on the NCBI version of Tifrunner.gnm1 annotations and BaileyII as we speak. They should be souschef-ified by EOD. Will then run them through pandagma, along with your Tifrunner.gnm2.ann2 candidate. I propose to demote singleton gnm2.ann2 models to "low-confidence." Separate thread though :-)
PS. this nascent GCV is based on assignment of the protein-coding genes to the legfed gene families. There are some interesting differences among the various annotation sets, concerning which I have a long email in progress (but the hijacker in me couldn't resist the opportunity to entangle this thread a bit)
The conversion is started. The first assembly to get updated chromosome prefixes (Arahy.01 => chr01) is Arachis/hypogaea/genomes/Tifrunner.gnm2.J5K5/ ... for correspondence with the new annotations at Tifrunner.gnm2.ann2.PVFB/
Some relationships will be broken until the prefixes in all of the assemblies and annotations are updated and consistent. Heads-up to @sdash-github , @adf-ncgr , @sammyjava
Tifrunner.gnm2.ann1.4K0L is among the broken.
The chromosome prefixes from the Arachis genome initiative are funky -- at least relative to the Data Store funk. They have the form "Gensp.[AB]*\d\d". I think this derives from our proto-yuckification (with addition of Aradu, Araip, or Arahy). The problem is that the prefixes (as we have pulled the data into the Data store) introduce an additional dot. That will be trouble for some downstream processes (pandagma; maybe the mines?)
Arguably, we should have replaced the proto-prefix with our current prefix, giving one of the following:
Doing either of these would break stuff -- but I am tempted to undertake the change, in pursuit of site-wide normalcy.
What do you think, @sdash-github ? @sammyjava ? @adf-ncgr ?