Closed sammyjava closed 1 year ago
I'm not sure who to blame for this so I assigned it to both of you to pass the blame on. This blocks completing the current GlycineMine build. Modified by Andrew, committed by Steven.
I will probably have to blame @StevenCannon-USDA for this one under the assumption he did the initial sous-chef-ing, though it's not clear to me what could have caused the observed behavior: 5705 records that don't have full yuck prefixing on either their own IDs or on the Parent attributes with which they try to refer to other entities:
zgrep 'ID=GmISU0' glyma.Wm82_ISU01.gnm2.ann1.FGFB.gene_models_main.gff3.gz | wc -l
5705
zgrep 'Parent=GmISU0' glyma.Wm82_ISU01.gnm2.ann1.FGFB.gene_models_main.gff3.gz | wc -l
5705
there is no obvious (to me) reason why this subset would have not gotten the full yuck prefixing when others did; but I'm probably not familiar enough with the workings of the sous-chef script to spot the hidden quirk. I think I could fix in an ad hoc way, but think @StevenCannon-USDA should probably get a chance to have a look in case something about the script needs addressing for future proofing against such follies.
I should also note that mRNA records affected by this in the gff file don't seem to have a problem with lack of prefixing in the derived fasta files, but not sure if they were actually derived from the gff using gffread in this case or just processed independently for prefix addition. An example:
zgrep GmISU01.16G085200.1 glyma.Wm82_ISU01.gnm2.ann1.FGFB.gene_models_main.gff3.gz
glyma.Wm82_ISU01.gnm2.Gm16 phytozomev13 mRNA 21011687 21025975 . + . ID=GmISU01.16G085200.1;Name=GmISU01.16G085200.1;pacid=53740494;longest=1;ancestorIdentifier=GmISU01.16G085200.1.v1.1;Parent=GmISU01.16G085200
glyma.Wm82_ISU01.gnm2.Gm16 phytozomev13 CDS 21011687 21012031 . + 0 ID=GmISU01.16G085200.1.CDS.1;Parent=GmISU01.16G085200.1;pacid=53740494
glyma.Wm82_ISU01.gnm2.Gm16 phytozomev13 CDS 21012447 21012785 . + 0 ID=GmISU01.16G085200.1.CDS.2;Parent=GmISU01.16G085200.1;pacid=53740494
glyma.Wm82_ISU01.gnm2.Gm16 phytozomev13 CDS 21012911 21013264 . + 0 ID=GmISU01.16G085200.1.CDS.3;Parent=GmISU01.16G085200.1;pacid=53740494
glyma.Wm82_ISU01.gnm2.Gm16 phytozomev13 CDS 21013347 21013470 . + 0 ID=GmISU01.16G085200.1.CDS.4;Parent=GmISU01.16G085200.1;pacid=53740494
glyma.Wm82_ISU01.gnm2.Gm16 phytozomev13 CDS 21014664 21014743 . + 2 ID=GmISU01.16G085200.1.CDS.5;Parent=GmISU01.16G085200.1;pacid=53740494
glyma.Wm82_ISU01.gnm2.Gm16 phytozomev13 CDS 21025961 21025975 . + 0 ID=GmISU01.16G085200.1.CDS.6;Parent=GmISU01.16G085200.1;pacid=53740494
zgrep GmISU01.16G085200.1 *faa.gz *fna.gz
glyma.Wm82_ISU01.gnm2.ann1.FGFB.protein_primary.faa.gz:>glyma.Wm82_ISU01.gnm2.ann1.GmISU01.16G085200.1
glyma.Wm82_ISU01.gnm2.ann1.FGFB.protein.faa.gz:>glyma.Wm82_ISU01.gnm2.ann1.GmISU01.16G085200.1
glyma.Wm82_ISU01.gnm2.ann1.FGFB.cds_primary.fna.gz:>glyma.Wm82_ISU01.gnm2.ann1.GmISU01.16G085200.1
glyma.Wm82_ISU01.gnm2.ann1.FGFB.cds.fna.gz:>glyma.Wm82_ISU01.gnm2.ann1.GmISU01.16G085200.1
glyma.Wm82_ISU01.gnm2.ann1.FGFB.mrna_primary.fna.gz:>glyma.Wm82_ISU01.gnm2.ann1.GmISU01.16G085200.1
glyma.Wm82_ISU01.gnm2.ann1.FGFB.mrna.fna.gz:>glyma.Wm82_ISU01.gnm2.ann1.GmISU01.16G085200.1
I'll see what I can do. I've been hit with another batch of urgent administrative stuff, and then travel on Thursday. But I'll be able to work while traveling.
No worries, maybe I should just patch it up and leave an unpatched copy in private for you to have a look at what might have been the cause, at your convenience. I'll just assume that's the plan unless I hear otherwise from you.
OK, I think this should be patched, but if you @StevenCannon-USDA want to look into what may have caused the issue, the pre-patch version of the file is available under /usr/local/www/data/private/Glycine/max/annotations/Wm82_ISU01.gnm2.ann1.FGFB maybe not worth the forensic effort, though! @sammyjava let me know if anything else comes up when you get to testing it via the loader.
Thank you, @adf-ncgr! I have indeed ended up mired in administrativia all day today. I will do the forensics soon, since we don't want this issue cropping up with other data sets.
sounds good- thanks for putting it on your dessert plate (for after you finish the administrative vegetables).
Looks good, thanks!