legumeinfo / datastore-issues

mostly for issues pertaining to the content of the legumeinfo datastore; may also relate to characteristics of its user interface or managing the mirroring process to the legfed instance
Other
1 stars 0 forks source link

Annotation collections failing validation #112

Closed sammyjava closed 2 years ago

sammyjava commented 2 years ago

Chafa95793S30485

sammyjava commented 2 years ago

Another check added after seeing that this file fouled up LegumeMine, which is just going to have to be the way it is because there is 0% chance of loading a LegumeMine without DS-sourced errors, because there are as many forms of errant data in the DS as there are galaxies in the new images from the James Webb. Whack-a-Mole is a trivial exercise in comparison.


Validating medtr collection A17.gnm5.ann1_6.L2RX

adf-ncgr commented 2 years ago

Another check added after seeing that this file fouled up LegumeMine, which is just going to have to be the way it is because there is 0% chance of loading a LegumeMine without DS-sourced errors, because there are as many forms of errant data in the DS as there are galaxies in the new images from the James Webb. Whack-a-Mole is a trivial exercise in comparison.

Validating medtr collection A17.gnm5.ann1_6.L2RX

  • [x] medtr.A17.gnm5.ann1_6.L2RX.phytozome_10_2.HFNR.gfa.tsv.gz

INVALID: Gene family identifier 59253386 in medtr.A17.gnm5.ann1_6.L2RX.phytozome_10_2.HFNR.gfa.tsv.gz is not valid:

medtr.A17.gnm5.ann1_6.MtrunA17Chr4g0042611 59253386        medtr.A17.gnm5.ann1_6.MtrunA17Chr4g0042611      1.6e-64 220.2   220.1

sorry about that, must have been generated before I baked the prefixes into the hmm db; should be fixed now, for whenever legumemine gets built again in future. BTW, I learned recently that "whack-a-mole" could be construed as "guacamole" in French; see https://justinehsmith.substack.com/p/notes-on-the-vibe-shift for more details if interested.

adf-ncgr commented 2 years ago

with respect to:

INVALID: README.Col.gnm9.ann10.WZQD.yml is not present in collection Col.gnm9.ann10.WZQD

and several others of this ilk, these are from non-legume outgroup species included in the gene families. I'm not sure I remember why they got split out into Arabidopsis/thaliana/annotations style collections, but they seem to consist only of primary proteins and gfa files. It's possible I constructed the latter via post-gene family build assignment in order to deal with the fact that (as I understand it) when the families were built at most one representative gene from each outgroup species was included, meaning that oftentimes arabidopsis genes will seem to not be present; e.g. if I look in the family fastas in the gene families I find only 12757 arabidopsis records wheres the gfa file contains 24616 assignments. I suppose it would be worth loading these assignments into the mines (to enable queries based on non-legume genes with known functions). Presumably the explanation I've just reminded myself of can serve as the meat (or plant-based protein) of their READMEs, when I get around to it.

sammyjava commented 2 years ago

I'm closing this, I hit invalid collections as I load mines, no plan to deal with this stuff if it's not breaking mine loads.