Genus/species/collection names:

These have been added, for under genomes/ and annotations/, for both Glycine max (23) and Glycine soja (3)

[X] Add collection(s) to the Data Store
[X] Validate the README(s)
[X] Update about_this_collection.yml
[X] Calculate AHRD functional annotations
[X] Calculate gene family assignments (.gfa)
[X] Add to pan-gene set
[ ] Load relevant mine
[ ] Add BLAST targets
[ ] Incorporate into GCV
[ ] Update the jekyll collections listing
[ ] Update browser configs
[x] run BUSCO

adf-ncgr commented 1 year ago

Hi Steven- regarding GCV incorporation, I think this is done already, though the 26 are living in their own special datasource (accessible through gcv.soybase.org): **

** It might well be worth redoing in a more formalized way (e.g. using Nathan's scripted approach for a docker build from the datastore); but before we go there it's probably worth asking if we want to continue to keep them somewhat separate as pictured above rather than folding everything into the one "SoyBase" source. I think @maxglycine has formerly had some desire to keep "SoyBase" limited to USDA materials, or maybe some even more restrictive criterion- though we may have already blown that by including the "IGA" set.

I myself am by no means clear on what end-users would find most convenient, although in the current GCV implementation it is somewhat helpful to divide things up a la federation, if for no other reason than to allow people not to request more computation than they really want (otherwise you can only post-filter unwanted genomes after the requests and calculations have been made to a monolithic datasource; in principle, we could add up-front query params, but so far we haven't done it).

I suppose similar "lump or split" questions could be asked about other apps, but it's probably most relevant for GCV given that it was built to unify splitters (much like the Judean People's Front).

sammyjava commented 1 year ago

These collections all lack mRNA FASTAs, which are required for mine loading. Therefore I'll label this issue with "missing data".

INVALID: Required file glyma.58-161.gnm1.ann1.HJ1K.mrna.fna.gz is not present in 58-161.gnm1.ann1.HJ1K
INVALID: Required file glyma.Amsoy.gnm1.ann1.6S5P.mrna.fna.gz is not present in Amsoy.gnm1.ann1.6S5P
INVALID: Required file glyma.DongNongNo_50.gnm1.ann1.QSDB.mrna.fna.gz is not present in DongNongNo_50.gnm1.ann1.QSDB
INVALID: Required file glyma.FengDiHuang.gnm1.ann1.P6HL.mrna.fna.gz is not present in FengDiHuang.gnm1.ann1.P6HL
INVALID: Required file glyma.HanDouNo_5.gnm1.ann1.ZS7M.mrna.fna.gz is not present in HanDouNo_5.gnm1.ann1.ZS7M
INVALID: Required file glyma.HeiHeNo_43.gnm1.ann1.PDXG.mrna.fna.gz is not present in HeiHeNo_43.gnm1.ann1.PDXG
INVALID: Required file glyma.JiDouNo_17.gnm1.ann1.X5PX.mrna.fna.gz is not present in JiDouNo_17.gnm1.ann1.X5PX
INVALID: Required file glyma.JinDouNo_23.gnm1.ann1.SGJW.mrna.fna.gz is not present in JinDouNo_23.gnm1.ann1.SGJW
INVALID: Required file glyma.JuXuanNo_23.gnm1.ann1.H8PW.mrna.fna.gz is not present in JuXuanNo_23.gnm1.ann1.H8PW
INVALID: Required file glyma.KeShanNo_1.gnm1.ann1.2YX4.mrna.fna.gz is not present in KeShanNo_1.gnm1.ann1.2YX4
INVALID: Required file glyma.PI_398296.gnm1.ann1.B0XR.mrna.fna.gz is not present in PI_398296.gnm1.ann1.B0XR
INVALID: Required file glyma.PI_548362.gnm1.ann1.LL84.mrna.fna.gz is not present in PI_548362.gnm1.ann1.LL84
INVALID: Required file glyma.QiHuangNo_34.gnm1.ann1.WHRV.mrna.fna.gz is not present in QiHuangNo_34.gnm1.ann1.WHRV
INVALID: Required file glyma.ShiShengChangYe.gnm1.ann1.VLGS.mrna.fna.gz is not present in ShiShengChangYe.gnm1.ann1.VLGS
INVALID: Required file glyma.TieFengNo_18.gnm1.ann1.7GR4.mrna.fna.gz is not present in TieFengNo_18.gnm1.ann1.7GR4
INVALID: Required file glyma.TieJiaSiLiHuang.gnm1.ann1.W70Z.mrna.fna.gz is not present in TieJiaSiLiHuang.gnm1.ann1.W70Z
INVALID: Required file glyma.TongShanTianEDan.gnm1.ann1.56XW.mrna.fna.gz is not present in TongShanTianEDan.gnm1.ann1.56XW
INVALID: Required file glyma.WanDouNo_28.gnm1.ann1.NLYP.mrna.fna.gz is not present in WanDouNo_28.gnm1.ann1.NLYP
INVALID: Required file glyma.XuDouNo_1.gnm1.ann1.G2T7.mrna.fna.gz is not present in XuDouNo_1.gnm1.ann1.G2T7
INVALID: Required file glyma.YuDouNo_22.gnm1.ann1.HCQ1.mrna.fna.gz is not present in YuDouNo_22.gnm1.ann1.HCQ1
INVALID: Required file glyma.ZhangChunManCangJin.gnm1.ann1.7HPB.mrna.fna.gz is not present in ZhangChunManCangJin.gnm1.ann1.7HPB
INVALID: Required file glyma.Zhutwinning2.gnm1.ann1.ZTTQ.mrna.fna.gz is not present in Zhutwinning2.gnm1.ann1.ZTTQ
INVALID: Required file glyma.ZiHuaNo_4.gnm1.ann1.FCFQ.mrna.fna.gz is not present in ZiHuaNo_4.gnm1.ann1.FCFQ
INVALID: Required file glyso.PI_562565.gnm1.ann1.1JD2.mrna.fna.gz is not present in PI_562565.gnm1.ann1.1JD2
INVALID: Required file glyso.PI_578357.gnm1.ann1.0ZKP.mrna.fna.gz is not present in PI_578357.gnm1.ann1.0ZKP
INVALID: Required file glyso.PI_549046.gnm1.ann1.65KD.mrna.fna.gz is not present in PI_549046.gnm1.ann1.65KD

StevenCannon-USDA commented 1 year ago

On it. Not difficult conceptually, but will require some careful bash work to deal with the 26 collections.

sammyjava commented 1 year ago

You are the King of Careful BASH Work.

StevenCannon-USDA commented 1 year ago

Ha. King of stubbing my toe on sharp BASH features.

StevenCannon-USDA commented 1 year ago

This is done (though of course let me know if there are further problems). My notes, fwiw, are at /usr/local/www/data/private/Glycine/liu_et_al_2020_pangenome/notes/derive_mrna_v01.sh

StevenCannon-USDA commented 1 year ago

Oops - reopening the multi-task (though the mRNA sub-task is done.)

sammyjava commented 1 year ago

The first collection I checked shows scads of duplicate CDS IDs in the GFF (now that I'm checking for dupes). Which is a different thing from the lack of the mRNA FASTA, but here's an example. I think typically we suffix .1, .2, etc.

glyma.58-161.gnm1.Chr06 maker   CDS 6544899 6545129 .   +   0   ID=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1.cds;Parent=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1;Parent_Accession=GWHTACEE021273;Protein_Accession=GWHPACEE021273
glyma.58-161.gnm1.Chr06 maker   CDS 6545391 6545531 .   +   0   ID=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1.cds;Parent=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1;Parent_Accession=GWHTACEE021273;Protein_Accession=GWHPACEE021273
glyma.58-161.gnm1.Chr06 maker   CDS 6545612 6545717 .   +   0   ID=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1.cds;Parent=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1;Parent_Accession=GWHTACEE021273;Protein_Accession=GWHPACEE021273
glyma.58-161.gnm1.Chr06 maker   CDS 6545974 6546019 .   +   1   ID=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1.cds;Parent=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1;Parent_Accession=GWHTACEE021273;Protein_Accession=GWHPACEE021273
glyma.58-161.gnm1.Chr06 maker   CDS 6546131 6546231 .   +   2   ID=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1.cds;Parent=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1;Parent_Accession=GWHTACEE021273;Protein_Accession=GWHPACEE021273
glyma.58-161.gnm1.Chr06 maker   CDS 6546305 6546356 .   +   1   ID=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1.cds;Parent=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1;Parent_Accession=GWHTACEE021273;Protein_Accession=GWHPACEE021273
glyma.58-161.gnm1.Chr06 maker   CDS 6546609 6546693 .   +   2   ID=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1.cds;Parent=glyma.58-161.gnm1.ann1.SoyL04_06G079400.m1;Parent_Accession=GWHTACEE021273;Protein_Accession=GWHPACEE021273

sammyjava commented 1 year ago

Be it known that most of these have duplicate CDS IDs and therefore do not pass validation and cannot be loaded into the mines.

INVALID: glyma.58-161.gnm1.ann1.HJ1K.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.Amsoy.gnm1.ann1.6S5P.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.DongNongNo_50.gnm1.ann1.QSDB.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.FengDiHuang.gnm1.ann1.P6HL.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.HanDouNo_5.gnm1.ann1.ZS7M.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.HeiHeNo_43.gnm1.ann1.PDXG.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.JiDouNo_17.gnm1.ann1.X5PX.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.JinDouNo_23.gnm1.ann1.SGJW.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.JuXuanNo_23.gnm1.ann1.H8PW.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.KeShanNo_1.gnm1.ann1.2YX4.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.PI_398296.gnm1.ann1.B0XR.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.PI_548362.gnm1.ann1.LL84.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.QiHuangNo_34.gnm1.ann1.WHRV.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.ShiShengChangYe.gnm1.ann1.VLGS.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.TieFengNo_18.gnm1.ann1.7GR4.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.TieJiaSiLiHuang.gnm1.ann1.W70Z.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.TongShanTianEDan.gnm1.ann1.56XW.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.WanDouNo_28.gnm1.ann1.NLYP.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.XuDouNo_1.gnm1.ann1.G2T7.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.YuDouNo_22.gnm1.ann1.HCQ1.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.ZhangChunManCangJin.gnm1.ann1.7HPB.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.Zhutwinning2.gnm1.ann1.ZTTQ.gene_models_main.gff3.gz record ID is duplicate of one already read:
INVALID: glyma.ZiHuaNo_4.gnm1.ann1.FCFQ.gene_models_main.gff3.gz record ID is duplicate of one already read:

In addition, we've got an invalid parent attribute which breaks loading this collection:

INVALID: glyma.Lee.gnm1.ann1.6NZV.gene_models_main.gff3.gz record parent attribute is invalid; does the file need sorting?
INVALID: glyma.Lee.gnm1.Gm01    phytozomev13    five_prime_UTR  37774   37783   0.0 .   ID=glyma.Lee.gnm1.ann1.GlymaLee.01G000100.1.five_prime_UTR.1;Parent=glyma.Lee.gnm1.ann1.GlymaLee.0

adf-ncgr commented 1 year ago

FYI- just started tackling this, and should be able to do most of what's needed but note that the relevant CHANGES files appear to have the wrong group ownership so I can't actually add notes to indicate what I've done. Probably not a big deal and shouldn't be a blocker for getting @sammyjava what he needs but @cann0010 maybe you can take care of those group issues at your convenience: find /usr/local/www/data/v2/Glycine -group scannon -ls 263882 9 -rw-rw-r-- 1 scannon scannon 253 Jan 25 12:44 /usr/local/www/data/v2/Glycine/soja/annotations/PI_562565.gnm1.ann1.1JD2/CHANGES.PI_562565.gnm1.ann1.1JD2.txt 264739 9 -rw-rw-r-- 1 scannon scannon 253 Jan 25 12:44 /usr/local/www/data/v2/Glycine/soja/annotations/PI_578357.gnm1.ann1.0ZKP/CHANGES.PI_578357.gnm1.ann1.0ZKP.txt ... (all 26 of the Liu et al annotation folders seem to be the ones affected)

adf-ncgr commented 1 year ago

I think these (+ 3 from Glycine soja not listed here but part of the Liu et al 2020 set and having the same issue) should all be handled. They were just using the convention where CDS features are grouped by ID, but they should now have unique suffixes. Lee is not part of that set and I don't know where the sorting on that came from, but I fixed it. Let me know if further issues manifest.

sammyjava commented 1 year ago

Running annotation validation on Glycine soja, I probably just didn't catch this earlier:

## INVALID: glyso.F_IGA1003.gnm1.ann1.G61B.gene_models_main.gff3.gz record parent attribute is invalid; does the file need sorting?
## INVALID: glyso.F_IGA1003.gnm1.chr11  maker   mRNA    6940    10361   0.0 .   ID=glyso.F_IGA1003.gnm1.ann1.SoyGsojaF_11G000200.1;Parent=glyso.F_IGA1003.gnm1.ann1.SoyGsojaF_11G000200;Name=SoyGsojaF_11G000200.1

sammyjava commented 1 year ago

Running annotation validation on Glycine max, in pretty good shape with just this straggler:

## INVALID: glyma.TieJiaSiLiHuang.gnm1.ann1.W70Z.gene_models_main.gff3.gz record ID is duplicate of one already read:
## INVALID: glyma.TieJiaSiLiHuang.gnm1.Chr01    maker   CDS -102970 -102904 0.0 2   ID=glyma.TieJiaSiLiHuang.gnm1.ann1.SoyL08_01G000100.m1.cds;Parent=glyma.TieJiaSiLiHuang.gnm1.ann1.SoyL08_01G000100.

adf-ncgr commented 1 year ago

whoops, I seem to have just overlooked that one. that should now be fixed as well as the glyso.F_IGA1003.gnm1.ann1 sorting issue.

sammyjava commented 1 year ago

Look good! I'll load the remaining annotations into Glycine/LegumeMine.

@cann0010 I'm all set with these new genomes/annotations, close this issue if we're all done with them.

StevenCannon-USDA commented 1 year ago

Awesome. Thanks for tackling this, @adf-ncgr and @sammyjava !

legumeinfo / datastore-issues

26 new Glycine genomes & annotations from Liu et al. 2020 #146

Main steps for adding new genome and annotation collections

Genus/species/collection names: