Closed StevenCannon-USDA closed 1 year ago
Hi Steven- regarding GCV incorporation, I think this is done already, though the 26 are living in their own special datasource (accessible through gcv.soybase.org): **
** It might well be worth redoing in a more formalized way (e.g. using Nathan's scripted approach for a docker build from the datastore); but before we go there it's probably worth asking if we want to continue to keep them somewhat separate as pictured above rather than folding everything into the one "SoyBase" source. I think @maxglycine has formerly had some desire to keep "SoyBase" limited to USDA materials, or maybe some even more restrictive criterion- though we may have already blown that by including the "IGA" set.
I myself am by no means clear on what end-users would find most convenient, although in the current GCV implementation it is somewhat helpful to divide things up a la federation, if for no other reason than to allow people not to request more computation than they really want (otherwise you can only post-filter unwanted genomes after the requests and calculations have been made to a monolithic datasource; in principle, we could add up-front query params, but so far we haven't done it).
I suppose similar "lump or split" questions could be asked about other apps, but it's probably most relevant for GCV given that it was built to unify splitters (much like the Judean People's Front).
These collections all lack mRNA FASTAs, which are required for mine loading. Therefore I'll label this issue with "missing data".
On it. Not difficult conceptually, but will require some careful bash work to deal with the 26 collections.
You are the King of Careful BASH Work.
Ha. King of stubbing my toe on sharp BASH features.
This is done (though of course let me know if there are further problems). My notes, fwiw, are at /usr/local/www/data/private/Glycine/liu_et_al_2020_pangenome/notes/derive_mrna_v01.sh
Oops - reopening the multi-task (though the mRNA sub-task is done.)
The first collection I checked shows scads of duplicate CDS IDs in the GFF (now that I'm checking for dupes). Which is a different thing from the lack of the mRNA FASTA, but here's an example. I think typically we suffix .1, .2, etc.
glyma.58-161.gnm1.Chr06 maker CDS 6544899 6545129 . + 0 ID=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1.cds;Parent=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1;Parent_Accession=GWHTACEE021273;Protein_Accession=GWHPACEE021273
glyma.58-161.gnm1.Chr06 maker CDS 6545391 6545531 . + 0 ID=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1.cds;Parent=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1;Parent_Accession=GWHTACEE021273;Protein_Accession=GWHPACEE021273
glyma.58-161.gnm1.Chr06 maker CDS 6545612 6545717 . + 0 ID=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1.cds;Parent=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1;Parent_Accession=GWHTACEE021273;Protein_Accession=GWHPACEE021273
glyma.58-161.gnm1.Chr06 maker CDS 6545974 6546019 . + 1 ID=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1.cds;Parent=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1;Parent_Accession=GWHTACEE021273;Protein_Accession=GWHPACEE021273
glyma.58-161.gnm1.Chr06 maker CDS 6546131 6546231 . + 2 ID=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1.cds;Parent=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1;Parent_Accession=GWHTACEE021273;Protein_Accession=GWHPACEE021273
glyma.58-161.gnm1.Chr06 maker CDS 6546305 6546356 . + 1 ID=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1.cds;Parent=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1;Parent_Accession=GWHTACEE021273;Protein_Accession=GWHPACEE021273
glyma.58-161.gnm1.Chr06 maker CDS 6546609 6546693 . + 2 ID=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1.cds;Parent=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1;Parent_Accession=GWHTACEE021273;Protein_Accession=GWHPACEE021273
Be it known that most of these have duplicate CDS IDs and therefore do not pass validation and cannot be loaded into the mines.
INVALID: glyma.58-161.gnm1.ann1.HJ1K.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.Amsoy.gnm1.ann1.6S5P.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.DongNongNo_50.gnm1.ann1.QSDB.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.FengDiHuang.gnm1.ann1.P6HL.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.HanDouNo_5.gnm1.ann1.ZS7M.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.HeiHeNo_43.gnm1.ann1.PDXG.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.JiDouNo_17.gnm1.ann1.X5PX.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.JinDouNo_23.gnm1.ann1.SGJW.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.JuXuanNo_23.gnm1.ann1.H8PW.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.KeShanNo_1.gnm1.ann1.2YX4.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.PI_398296.gnm1.ann1.B0XR.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.PI_548362.gnm1.ann1.LL84.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.QiHuangNo_34.gnm1.ann1.WHRV.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.ShiShengChangYe.gnm1.ann1.VLGS.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.TieFengNo_18.gnm1.ann1.7GR4.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.TieJiaSiLiHuang.gnm1.ann1.W70Z.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.TongShanTianEDan.gnm1.ann1.56XW.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.WanDouNo_28.gnm1.ann1.NLYP.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.XuDouNo_1.gnm1.ann1.G2T7.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.YuDouNo_22.gnm1.ann1.HCQ1.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.ZhangChunManCangJin.gnm1.ann1.7HPB.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.Zhutwinning2.gnm1.ann1.ZTTQ.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.ZiHuaNo_4.gnm1.ann1.FCFQ.gene_models_main.gff3.gz record ID is duplicate of one already read:
In addition, we've got an invalid parent attribute which breaks loading this collection:
INVALID: glyma.Lee.gnm1.ann1.6NZV.gene_models_main.gff3.gz record parent attribute is invalid; does the file need sorting?
INVALID: glyma.Lee.gnm1.Gm01 phytozomev13 five_prime_UTR 37774 37783 0.0 . ID=glyma.Lee.gnm1.ann1.GlymaLee.01G000100.1.five_prime_UTR.1;Parent=glyma.Lee.gnm1.ann1.GlymaLee.0
FYI- just started tackling this, and should be able to do most of what's needed but note that the relevant CHANGES files appear to have the wrong group ownership so I can't actually add notes to indicate what I've done. Probably not a big deal and shouldn't be a blocker for getting @sammyjava what he needs but @cann0010 maybe you can take care of those group issues at your convenience: find /usr/local/www/data/v2/Glycine -group scannon -ls 263882 9 -rw-rw-r-- 1 scannon scannon 253 Jan 25 12:44 /usr/local/www/data/v2/Glycine/soja/annotations/PI_562565.gnm1.ann1.1JD2/CHANGES.PI_562565.gnm1.ann1.1JD2.txt 264739 9 -rw-rw-r-- 1 scannon scannon 253 Jan 25 12:44 /usr/local/www/data/v2/Glycine/soja/annotations/PI_578357.gnm1.ann1.0ZKP/CHANGES.PI_578357.gnm1.ann1.0ZKP.txt ... (all 26 of the Liu et al annotation folders seem to be the ones affected)
I think these (+ 3 from Glycine soja not listed here but part of the Liu et al 2020 set and having the same issue) should all be handled. They were just using the convention where CDS features are grouped by ID, but they should now have unique suffixes. Lee is not part of that set and I don't know where the sorting on that came from, but I fixed it. Let me know if further issues manifest.
Running annotation validation on Glycine soja, I probably just didn't catch this earlier:
## INVALID: glyso.F_IGA1003.gnm1.ann1.G61B.gene_models_main.gff3.gz record parent attribute is invalid; does the file need sorting?
## INVALID: glyso.F_IGA1003.gnm1.chr11 maker mRNA 6940 10361 0.0 . ID=glyso.F_IGA1003.gnm1.ann1.SoyGsojaF_11G000200.1;Parent=glyso.F_IGA1003.gnm1.ann1.SoyGsojaF_11G000200;Name=SoyGsojaF_11G000200.1
Running annotation validation on Glycine max, in pretty good shape with just this straggler:
## INVALID: glyma.TieJiaSiLiHuang.gnm1.ann1.W70Z.gene_models_main.gff3.gz record ID is duplicate of one already read:
## INVALID: glyma.TieJiaSiLiHuang.gnm1.Chr01 maker CDS -102970 -102904 0.0 2 ID=glyma.TieJiaSiLiHuang.gnm1.ann1.SoyL08_01G000100.m1.cds;Parent=glyma.TieJiaSiLiHuang.gnm1.ann1.SoyL08_01G000100.
whoops, I seem to have just overlooked that one. that should now be fixed as well as the glyso.F_IGA1003.gnm1.ann1 sorting issue.
Look good! I'll load the remaining annotations into Glycine/LegumeMine.
@cann0010 I'm all set with these new genomes/annotations, close this issue if we're all done with them.
Awesome. Thanks for tackling this, @adf-ncgr and @sammyjava !
Main steps for adding new genome and annotation collections
Genus/species/collection names:
These have been added, for under genomes/ and annotations/, for both Glycine max (23) and Glycine soja (3)