legumeinfo / datastore-issues

mostly for issues pertaining to the content of the legumeinfo datastore; may also relate to characteristics of its user interface or managing the mirroring process to the legfed instance
Other
1 stars 0 forks source link

glyma.Wm82_ISU01.gnm2.ann1.FGFB.gene_models_main.gff3.gz has bad (non-LIS) IDs #171

Closed sammyjava closed 1 year ago

sammyjava commented 1 year ago
[convertFile] ## Validating glyma collection Wm82_ISU01.gnm2.ann1.FGFB
[convertFile]  - glyma.Wm82_ISU01.gnm2.ann1.FGFB.gene_models_main.gff3.gz
[convertFile] ## INVALID: glyma.Wm82_ISU01.gnm2.ann1.FGFB.gene_models_main.gff3.gz record ID attribute is missing or invalid:
[convertFile] ## INVALID: glyma.Wm82_ISU01.gnm2.Gm01 phytozomev13    CDS     311075  311213  0.0     0       ID=GmISU01.01G001200.1.CDS.18;Parent=GmISU01.01G001200.1;pacid=53729792
[convertFile] ## INVALID: {Parent=GmISU01.01G001200.1, pacid=53729792, ID=GmISU01.01G001200.1.CDS.18}
[convertFile] ## INVALID: glyma.Wm82_ISU01.gnm2.ann1.FGFB.gene_models_main.gff3.gz record parent attribute is invalid; does the file need sorting?
[convertFile] ## INVALID: glyma.Wm82_ISU01.gnm2.Gm01 phytozomev13    CDS     311075  311213  0.0     0       ID=GmISU01.01G001200.1.CDS.18;Parent=GmISU01.01G001200.1;pacid=53729792
[convertFile] ## INVALID: {Parent=GmISU01.01G001200.1, pacid=53729792, ID=GmISU01.01G001200.1.CDS.18}
sammyjava commented 1 year ago

I'm not sure who to blame for this so I assigned it to both of you to pass the blame on. This blocks completing the current GlycineMine build. Modified by Andrew, committed by Steven.

adf-ncgr commented 1 year ago

I will probably have to blame @StevenCannon-USDA for this one under the assumption he did the initial sous-chef-ing, though it's not clear to me what could have caused the observed behavior: 5705 records that don't have full yuck prefixing on either their own IDs or on the Parent attributes with which they try to refer to other entities:

zgrep 'ID=GmISU0' glyma.Wm82_ISU01.gnm2.ann1.FGFB.gene_models_main.gff3.gz | wc -l
    5705
zgrep 'Parent=GmISU0' glyma.Wm82_ISU01.gnm2.ann1.FGFB.gene_models_main.gff3.gz | wc -l
    5705

there is no obvious (to me) reason why this subset would have not gotten the full yuck prefixing when others did; but I'm probably not familiar enough with the workings of the sous-chef script to spot the hidden quirk. I think I could fix in an ad hoc way, but think @StevenCannon-USDA should probably get a chance to have a look in case something about the script needs addressing for future proofing against such follies.

adf-ncgr commented 1 year ago

I should also note that mRNA records affected by this in the gff file don't seem to have a problem with lack of prefixing in the derived fasta files, but not sure if they were actually derived from the gff using gffread in this case or just processed independently for prefix addition. An example:

zgrep GmISU01.16G085200.1 glyma.Wm82_ISU01.gnm2.ann1.FGFB.gene_models_main.gff3.gz
glyma.Wm82_ISU01.gnm2.Gm16  phytozomev13    mRNA    21011687    21025975    .   +   .   ID=GmISU01.16G085200.1;Name=GmISU01.16G085200.1;pacid=53740494;longest=1;ancestorIdentifier=GmISU01.16G085200.1.v1.1;Parent=GmISU01.16G085200
glyma.Wm82_ISU01.gnm2.Gm16  phytozomev13    CDS 21011687    21012031    .   +   0   ID=GmISU01.16G085200.1.CDS.1;Parent=GmISU01.16G085200.1;pacid=53740494
glyma.Wm82_ISU01.gnm2.Gm16  phytozomev13    CDS 21012447    21012785    .   +   0   ID=GmISU01.16G085200.1.CDS.2;Parent=GmISU01.16G085200.1;pacid=53740494
glyma.Wm82_ISU01.gnm2.Gm16  phytozomev13    CDS 21012911    21013264    .   +   0   ID=GmISU01.16G085200.1.CDS.3;Parent=GmISU01.16G085200.1;pacid=53740494
glyma.Wm82_ISU01.gnm2.Gm16  phytozomev13    CDS 21013347    21013470    .   +   0   ID=GmISU01.16G085200.1.CDS.4;Parent=GmISU01.16G085200.1;pacid=53740494
glyma.Wm82_ISU01.gnm2.Gm16  phytozomev13    CDS 21014664    21014743    .   +   2   ID=GmISU01.16G085200.1.CDS.5;Parent=GmISU01.16G085200.1;pacid=53740494
glyma.Wm82_ISU01.gnm2.Gm16  phytozomev13    CDS 21025961    21025975    .   +   0   ID=GmISU01.16G085200.1.CDS.6;Parent=GmISU01.16G085200.1;pacid=53740494

zgrep GmISU01.16G085200.1 *faa.gz *fna.gz
glyma.Wm82_ISU01.gnm2.ann1.FGFB.protein_primary.faa.gz:>glyma.Wm82_ISU01.gnm2.ann1.GmISU01.16G085200.1
glyma.Wm82_ISU01.gnm2.ann1.FGFB.protein.faa.gz:>glyma.Wm82_ISU01.gnm2.ann1.GmISU01.16G085200.1
glyma.Wm82_ISU01.gnm2.ann1.FGFB.cds_primary.fna.gz:>glyma.Wm82_ISU01.gnm2.ann1.GmISU01.16G085200.1
glyma.Wm82_ISU01.gnm2.ann1.FGFB.cds.fna.gz:>glyma.Wm82_ISU01.gnm2.ann1.GmISU01.16G085200.1
glyma.Wm82_ISU01.gnm2.ann1.FGFB.mrna_primary.fna.gz:>glyma.Wm82_ISU01.gnm2.ann1.GmISU01.16G085200.1
glyma.Wm82_ISU01.gnm2.ann1.FGFB.mrna.fna.gz:>glyma.Wm82_ISU01.gnm2.ann1.GmISU01.16G085200.1
StevenCannon-USDA commented 1 year ago

I'll see what I can do. I've been hit with another batch of urgent administrative stuff, and then travel on Thursday. But I'll be able to work while traveling.

adf-ncgr commented 1 year ago

No worries, maybe I should just patch it up and leave an unpatched copy in private for you to have a look at what might have been the cause, at your convenience. I'll just assume that's the plan unless I hear otherwise from you.

adf-ncgr commented 1 year ago

OK, I think this should be patched, but if you @StevenCannon-USDA want to look into what may have caused the issue, the pre-patch version of the file is available under /usr/local/www/data/private/Glycine/max/annotations/Wm82_ISU01.gnm2.ann1.FGFB maybe not worth the forensic effort, though! @sammyjava let me know if anything else comes up when you get to testing it via the loader.

StevenCannon-USDA commented 1 year ago

Thank you, @adf-ncgr! I have indeed ended up mired in administrativia all day today. I will do the forensics soon, since we don't want this issue cropping up with other data sets.

adf-ncgr commented 1 year ago

sounds good- thanks for putting it on your dessert plate (for after you finish the administrative vegetables).

sammyjava commented 1 year ago

Looks good, thanks!