legumeinfo / datastore-issues

mostly for issues pertaining to the content of the legumeinfo datastore; may also relate to characteristics of its user interface or managing the mirroring process to the legfed instance
Other
1 stars 0 forks source link

26 new Glycine genomes & annotations from Liu et al. 2020 #146

Closed StevenCannon-USDA closed 1 year ago

StevenCannon-USDA commented 1 year ago

Main steps for adding new genome and annotation collections

Genus/species/collection names:

These have been added, for under genomes/ and annotations/, for both Glycine max (23) and Glycine soja (3)

adf-ncgr commented 1 year ago

Hi Steven- regarding GCV incorporation, I think this is done already, though the 26 are living in their own special datasource (accessible through gcv.soybase.org): **

image

** It might well be worth redoing in a more formalized way (e.g. using Nathan's scripted approach for a docker build from the datastore); but before we go there it's probably worth asking if we want to continue to keep them somewhat separate as pictured above rather than folding everything into the one "SoyBase" source. I think @maxglycine has formerly had some desire to keep "SoyBase" limited to USDA materials, or maybe some even more restrictive criterion- though we may have already blown that by including the "IGA" set.

I myself am by no means clear on what end-users would find most convenient, although in the current GCV implementation it is somewhat helpful to divide things up a la federation, if for no other reason than to allow people not to request more computation than they really want (otherwise you can only post-filter unwanted genomes after the requests and calculations have been made to a monolithic datasource; in principle, we could add up-front query params, but so far we haven't done it).

I suppose similar "lump or split" questions could be asked about other apps, but it's probably most relevant for GCV given that it was built to unify splitters (much like the Judean People's Front).

sammyjava commented 1 year ago

These collections all lack mRNA FASTAs, which are required for mine loading. Therefore I'll label this issue with "missing data".

StevenCannon-USDA commented 1 year ago

On it. Not difficult conceptually, but will require some careful bash work to deal with the 26 collections.

sammyjava commented 1 year ago

You are the King of Careful BASH Work.

StevenCannon-USDA commented 1 year ago

Ha. King of stubbing my toe on sharp BASH features.

StevenCannon-USDA commented 1 year ago

This is done (though of course let me know if there are further problems). My notes, fwiw, are at /usr/local/www/data/private/Glycine/liu_et_al_2020_pangenome/notes/derive_mrna_v01.sh

StevenCannon-USDA commented 1 year ago

Oops - reopening the multi-task (though the mRNA sub-task is done.)

sammyjava commented 1 year ago

The first collection I checked shows scads of duplicate CDS IDs in the GFF (now that I'm checking for dupes). Which is a different thing from the lack of the mRNA FASTA, but here's an example. I think typically we suffix .1, .2, etc.

glyma.58-161.gnm1.Chr06 maker   CDS 6544899 6545129 .   +   0   ID=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1.cds;Parent=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1;Parent_Accession=GWHTACEE021273;Protein_Accession=GWHPACEE021273
glyma.58-161.gnm1.Chr06 maker   CDS 6545391 6545531 .   +   0   ID=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1.cds;Parent=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1;Parent_Accession=GWHTACEE021273;Protein_Accession=GWHPACEE021273
glyma.58-161.gnm1.Chr06 maker   CDS 6545612 6545717 .   +   0   ID=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1.cds;Parent=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1;Parent_Accession=GWHTACEE021273;Protein_Accession=GWHPACEE021273
glyma.58-161.gnm1.Chr06 maker   CDS 6545974 6546019 .   +   1   ID=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1.cds;Parent=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1;Parent_Accession=GWHTACEE021273;Protein_Accession=GWHPACEE021273
glyma.58-161.gnm1.Chr06 maker   CDS 6546131 6546231 .   +   2   ID=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1.cds;Parent=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1;Parent_Accession=GWHTACEE021273;Protein_Accession=GWHPACEE021273
glyma.58-161.gnm1.Chr06 maker   CDS 6546305 6546356 .   +   1   ID=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1.cds;Parent=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1;Parent_Accession=GWHTACEE021273;Protein_Accession=GWHPACEE021273
glyma.58-161.gnm1.Chr06 maker   CDS 6546609 6546693 .   +   2   ID=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1.cds;Parent=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1;Parent_Accession=GWHTACEE021273;Protein_Accession=GWHPACEE021273
sammyjava commented 1 year ago

Be it known that most of these have duplicate CDS IDs and therefore do not pass validation and cannot be loaded into the mines.

INVALID: glyma.58-161.gnm1.ann1.HJ1K.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.Amsoy.gnm1.ann1.6S5P.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.DongNongNo_50.gnm1.ann1.QSDB.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.FengDiHuang.gnm1.ann1.P6HL.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.HanDouNo_5.gnm1.ann1.ZS7M.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.HeiHeNo_43.gnm1.ann1.PDXG.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.JiDouNo_17.gnm1.ann1.X5PX.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.JinDouNo_23.gnm1.ann1.SGJW.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.JuXuanNo_23.gnm1.ann1.H8PW.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.KeShanNo_1.gnm1.ann1.2YX4.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.PI_398296.gnm1.ann1.B0XR.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.PI_548362.gnm1.ann1.LL84.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.QiHuangNo_34.gnm1.ann1.WHRV.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.ShiShengChangYe.gnm1.ann1.VLGS.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.TieFengNo_18.gnm1.ann1.7GR4.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.TieJiaSiLiHuang.gnm1.ann1.W70Z.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.TongShanTianEDan.gnm1.ann1.56XW.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.WanDouNo_28.gnm1.ann1.NLYP.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.XuDouNo_1.gnm1.ann1.G2T7.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.YuDouNo_22.gnm1.ann1.HCQ1.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.ZhangChunManCangJin.gnm1.ann1.7HPB.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.Zhutwinning2.gnm1.ann1.ZTTQ.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.ZiHuaNo_4.gnm1.ann1.FCFQ.gene_models_main.gff3.gz record ID is duplicate of one already read:

In addition, we've got an invalid parent attribute which breaks loading this collection:

INVALID: glyma.Lee.gnm1.ann1.6NZV.gene_models_main.gff3.gz record parent attribute is invalid; does the file need sorting?
INVALID: glyma.Lee.gnm1.Gm01    phytozomev13    five_prime_UTR  37774   37783   0.0 .   ID=glyma.Lee.gnm1.ann1.GlymaLee.01G000100.1.five_prime_UTR.1;Parent=glyma.Lee.gnm1.ann1.GlymaLee.0
adf-ncgr commented 1 year ago

FYI- just started tackling this, and should be able to do most of what's needed but note that the relevant CHANGES files appear to have the wrong group ownership so I can't actually add notes to indicate what I've done. Probably not a big deal and shouldn't be a blocker for getting @sammyjava what he needs but @cann0010 maybe you can take care of those group issues at your convenience: find /usr/local/www/data/v2/Glycine -group scannon -ls 263882 9 -rw-rw-r-- 1 scannon scannon 253 Jan 25 12:44 /usr/local/www/data/v2/Glycine/soja/annotations/PI_562565.gnm1.ann1.1JD2/CHANGES.PI_562565.gnm1.ann1.1JD2.txt 264739 9 -rw-rw-r-- 1 scannon scannon 253 Jan 25 12:44 /usr/local/www/data/v2/Glycine/soja/annotations/PI_578357.gnm1.ann1.0ZKP/CHANGES.PI_578357.gnm1.ann1.0ZKP.txt ... (all 26 of the Liu et al annotation folders seem to be the ones affected)

adf-ncgr commented 1 year ago

I think these (+ 3 from Glycine soja not listed here but part of the Liu et al 2020 set and having the same issue) should all be handled. They were just using the convention where CDS features are grouped by ID, but they should now have unique suffixes. Lee is not part of that set and I don't know where the sorting on that came from, but I fixed it. Let me know if further issues manifest.

sammyjava commented 1 year ago

Running annotation validation on Glycine soja, I probably just didn't catch this earlier:

## INVALID: glyso.F_IGA1003.gnm1.ann1.G61B.gene_models_main.gff3.gz record parent attribute is invalid; does the file need sorting?
## INVALID: glyso.F_IGA1003.gnm1.chr11  maker   mRNA    6940    10361   0.0 .   ID=glyso.F_IGA1003.gnm1.ann1.SoyGsojaF_11G000200.1;Parent=glyso.F_IGA1003.gnm1.ann1.SoyGsojaF_11G000200;Name=SoyGsojaF_11G000200.1
sammyjava commented 1 year ago

Running annotation validation on Glycine max, in pretty good shape with just this straggler:

## INVALID: glyma.TieJiaSiLiHuang.gnm1.ann1.W70Z.gene_models_main.gff3.gz record ID is duplicate of one already read:
## INVALID: glyma.TieJiaSiLiHuang.gnm1.Chr01    maker   CDS -102970 -102904 0.0 2   ID=glyma.TieJiaSiLiHuang.gnm1.ann1.SoyL08_01G000100.m1.cds;Parent=glyma.TieJiaSiLiHuang.gnm1.ann1.SoyL08_01G000100.
adf-ncgr commented 1 year ago

whoops, I seem to have just overlooked that one. that should now be fixed as well as the glyso.F_IGA1003.gnm1.ann1 sorting issue.

sammyjava commented 1 year ago

Look good! I'll load the remaining annotations into Glycine/LegumeMine.

@cann0010 I'm all set with these new genomes/annotations, close this issue if we're all done with them.

StevenCannon-USDA commented 1 year ago

Awesome. Thanks for tackling this, @adf-ncgr and @sammyjava !