Closed sammyjava closed 2 years ago
Another check added after seeing that this file fouled up LegumeMine, which is just going to have to be the way it is because there is 0% chance of loading a LegumeMine without DS-sourced errors, because there are as many forms of errant data in the DS as there are galaxies in the new images from the James Webb. Whack-a-Mole is a trivial exercise in comparison.
medtr.A17.gnm5.ann1_6.MtrunA17Chr4g0042611 59253386 medtr.A17.gnm5.ann1_6.MtrunA17Chr4g0042611 1.6e-64 220.2 220.1
Another check added after seeing that this file fouled up LegumeMine, which is just going to have to be the way it is because there is 0% chance of loading a LegumeMine without DS-sourced errors, because there are as many forms of errant data in the DS as there are galaxies in the new images from the James Webb. Whack-a-Mole is a trivial exercise in comparison.
Validating medtr collection A17.gnm5.ann1_6.L2RX
- [x] medtr.A17.gnm5.ann1_6.L2RX.phytozome_10_2.HFNR.gfa.tsv.gz
INVALID: Gene family identifier 59253386 in medtr.A17.gnm5.ann1_6.L2RX.phytozome_10_2.HFNR.gfa.tsv.gz is not valid:
medtr.A17.gnm5.ann1_6.MtrunA17Chr4g0042611 59253386 medtr.A17.gnm5.ann1_6.MtrunA17Chr4g0042611 1.6e-64 220.2 220.1
sorry about that, must have been generated before I baked the prefixes into the hmm db; should be fixed now, for whenever legumemine gets built again in future. BTW, I learned recently that "whack-a-mole" could be construed as "guacamole" in French; see https://justinehsmith.substack.com/p/notes-on-the-vibe-shift for more details if interested.
with respect to:
INVALID: README.Col.gnm9.ann10.WZQD.yml is not present in collection Col.gnm9.ann10.WZQD
and several others of this ilk, these are from non-legume outgroup species included in the gene families. I'm not sure I remember why they got split out into Arabidopsis/thaliana/annotations style collections, but they seem to consist only of primary proteins and gfa files. It's possible I constructed the latter via post-gene family build assignment in order to deal with the fact that (as I understand it) when the families were built at most one representative gene from each outgroup species was included, meaning that oftentimes arabidopsis genes will seem to not be present; e.g. if I look in the family fastas in the gene families I find only 12757 arabidopsis records wheres the gfa file contains 24616 assignments. I suppose it would be worth loading these assignments into the mines (to enable queries based on non-legume genes with known functions). Presumably the explanation I've just reminded myself of can serve as the meat (or plant-based protein) of their READMEs, when I get around to it.
I'm closing this, I hit invalid collections as I load mines, no plan to deal with this stuff if it's not breaking mine loads.
INVALID: README.Col.gnm9.ann10.WZQD.yml is not present in collection Col.gnm9.ann10.WZQD
INVALID: Required file type gene_models_main.gff3.gz is not present in ISC453364.gnm1.ann1.HZJM
INVALID: chafa.MN87.gnm1.ann1.LWFM.gene_models_main.gff3.gz has an invalid ID attribute:
INVALID: chafa.MN87.gnm1.ann1.LWFM.protein.faa.gz has an invalid sequence identifier in header:
Chafa95793S30485
INVALID: README.Gy14.gnm1.ann1.Y8XX.yml is not present in collection Gy14.gnm1.ann1.Y8XX
INVALID: faial.WAFC.gnm1.ann1.RTP9.gene_models_main.gff3.gz has an invalid ID attribute:
INVALID: glyma.Lee.gnm1.ann1.6NZV.gene_models_main.gff3.gz record parent attribute is invalid; does the file need sorting?
INVALID: glyso.F_IGA1003.gnm1.ann1.G61B.gene_models_main.gff3.gz record parent attribute is invalid; does the file need sorting?
INVALID: lotja.MG20.gnm3.ann1.WF9B.gene_models_main.gff3.gz record ID attribute is missing or invalid:
INVALID: lotja.MG20.gnm3.ann1.WF9B.protein.faa.gz has an invalid sequence identifier in header:
AB433810.1
INVALID: lotja.MG20.gnm3.ann1.WF9B.protein_primary.faa.gz has an invalid sequence identifier in header:
AB433810.1
INVALID: lotja.MG20.gnm3.ann1.WF9B.cds.fna.gz has an invalid sequence identifier in header:
AB433810.1
INVALID: lotja.MG20.gnm3.ann1.WF9B.cds_primary.fna.gz has an invalid sequence identifier in header:
AB433810.1
INVALID: lotja.MG20.gnm3.ann1.WF9B.mrna.fna.gz has an invalid sequence identifier in header:
AB433810.1
INVALID: lotja.MG20.gnm3.ann1.WF9B.mrna_primary.fna.gz has an invalid sequence identifier in header:
AB433810.1
INVALID: medtr.R108_HM340.gnm1.ann1.85YW.iprscan.gff3.gz has an invalid seqname:
INVALID: README.Lovell.gnm2.ann1.S2ZZ.yml is not present in collection Lovell.gnm2.ann1.S2ZZ
INVALID: README.Heinz1706.gnm2_5.ann2_4.Q2DC.yml is not present in collection Heinz1706.gnm2_5.ann2_4.Q2DC
INVALID: tripr.MilvusB.gnm2.ann1.DFgp.gene_models_main.gff3.gz record parent attribute is invalid; does the file need sorting?
INVALID: README.PN40024.gnm12X.ann1.V31M.yml is not present in collection PN40024.gnm12X.ann1.V31M