Closed pombase-admin closed 9 years ago
I'll do the GAF export first as it's easier and probably more immediately useful. It's also easy to test because we can compare the output to the current GeneDB GAF files.
Original comment by: kimrutherford
extending this item to cover phenotype annotations
Sent this to Chris M when done
Original comment by: ValWood
Original comment by: ValWood
GPAD/GPI documentation http://www.geneontology.org/GO.format.gpi.shtml http://www.geneontology.org/GO.format.gpad.shtml Not sure this documentation is up to date, check before use
Original comment by: ValWood
Also we need to replace some "Mappign files" generated from GeneDB They are all simple tab delimited format
Mapping files described here: http://www.pombase.org/downloads/data-mapping
Original comment by: ValWood
Original comment by: kimrutherford
Apparently I implemented some of this back in January. I don't remember doing it, but it probably wasn't someone else.
The code as is doesn't write out the annotation extensions, so that still needs doing.
Original comment by: kimrutherford
I've added a separate tracker item for the GeneDB style mapping files: https://sourceforge.net/tracker/?func=detail&aid=3541336&group\_id=65526&atid=2096276
Original comment by: kimrutherford
When we do the export we need to also create a ND mapping for when a gene product is missing a particular aspect.
I checked the weekend before I went away that these were all valid (most are for conserved unknowns and sequence orphans, and as far as I am aware there are no outstanding appers for any of these.
The numbers will be quite low (see Venn attached) (in fact the numbers will be slightly less than this because I managed to squeeze a few more ISS annotations in)
Original comment by: ValWood
Original comment by: ValWood
Only add NDs for protein coding genes. Add an ND annotation for each aspect that has no other annotation.
Original comment by: kimrutherford
Does it make sense for pseudogenes to have GO annotation?
If so, should the annotation be exported to the GAF file?: http://www.pombase.org/spombe/result/SPAC23D3.05c
Original comment by: kimrutherford
Your question has prompted me to trawl through a fairly old (March 2006) exchange on GO & SO mailing lists, which can be scary. It starts here:
https://mailman.stanford.edu/pipermail/go-discuss/2006-March/001782.html
... goes on and on, and some of it comes out in separate threads in the archive:
https://mailman.stanford.edu/pipermail/go-discuss/2006-March/thread.html
... and it boils down to "it's complicated". But I think some usable simple answers would be ...
> Does it make sense for pseudogenes to have GO annotation?
Probably not; maybe with some rare exceptions, but if we did just say "never" we probably wouldn't lose much.
The exceptions would come up if someone finds that a "pseudogene" is transcribed, and the transcript does something, usually regulatory, e.g. acts as antisense RNA for a "live" copy of the gene. But the email exchange included arguments that if that happens the feature shouldn't be called a pseudogene anyway ... but what if the community expect to see it called a pseudogene ... argh argh aaargh.
That leads me to the second simple answer:
> If so, should the annotation be exported to the GAF file?
No. As long as we call something a pseudogene, we should not export any GO annotations for it to the GAF, even if we want to have the annotations for one of the exceptional circumstances. We don't want to piss them off, and we really don't want to reopen the can of pseudo-worms!
m
Original comment by: mah11
Right so. Thanks for that.
I won't include the pseudogenes in the GAF output.
Unfortunately I've meanwhile found a worse anomaly in Chado - there are lots of annotations without evidence code. I'm trying to track down how that happened.
Original comment by: kimrutherford
This fits with what I would say. At present i delete psudos before submission. if a psudo warrnts any kind of annotation we would probably remove the pseudo tag and make another feature like a non coding RNA with a regultory role
Original comment by: ValWood
The Chado database was missing some dates too - all dates from the input GAF files weren't being stored. That's fixed too and I'm now re-loading. I don't think that's a problem for Ensembl as the dates aren't soon on pombase.org.
Original comment by: kimrutherford
The GAF writing is coming along, there are still some problems to fix but it now puts things like "happens_during(GO:0071276),has_regulation_target(SPBC660.07)" in column 16.
You've probably told me before, but do we need to put anything in column 17? This page implies that it's optional: http://www.geneontology.org/GO.format.gaf-2\_0.shtml
Original comment by: kimrutherford
> do we need to put anything in column 17?
It's only desirable for the tiny number of annotations where there's a tag like "column_17=PR:000027503;". Even for them, it's optional on GO's end, and the reason to include it is to provide a bit more specificity about what form of a gene product is doing the business. If you do include it, just write out the "PR:000027503" part.
There will probably be more column 17 entries in future, but they'll accumulate slowly.
m
Original comment by: mah11
Thanks. It should be easy to add it where there is a "column_17" property in Chado.
Original comment by: kimrutherford
might as well do it then
(and see if it freaks 'em out as much as it did when we put stuff in column 16 ;) )
Original comment by: mah11
It turns out that I had implement the column 17 stuff months ago and then forgot. So it's already done.
Original comment by: kimrutherford
There's another problem with dates. Some are stored in the GO way "20121022" (from GAF files) and some are stored in the ISO standard way "2012-10-22" (from the curation tool). The GAF files I'm writing have both formats, which the GAF file checker disapproves of. I'll rationalise things.
Original comment by: kimrutherford
This is mostly done. The GO filter-gene-association.pl script now reports only 20-ish errors, which we are looking at.
I haven't added the ND annotations. I'll do that next.
Original comment by: kimrutherford
The ND rows are now in, with no complaints from filter-gene-association.pl
Original comment by: kimrutherford
You probably know this but column 12 (the feature type) can contain the SO feature ID. val
Original comment by: ValWood
Do you think a SO ID would be better? It's easy enough to change.
Original comment by: kimrutherford
I think so, it is the newer way, and I'm not sure but I think GPAD expects it, so it would make both the same
Original comment by: ValWood
I think the GAF part is done, but I haven't started on the FYPO export yet.
Original comment by: kimrutherford
Diff:
--- old
+++ new
@@ -1,4 +1,3 @@
-
when we export the GO data we need to make sure that only synonyms are in the synonym column, not like current GAF
Original comment by: ValWood
Curators need to spec out format, then this can be raised.
Also probably a separate ticket for GO GPAD/GPI?
Original comment by: ValWood
Diff:
--- old
+++ new
@@ -1,3 +1,2 @@
-
when we export the GO data we need to make sure that only synonyms are in the synonym column, not like current GAF
Original comment by: mah11
should have been closed ages ago
Original comment by: mah11
when we export the GO data we need to make sure that only synonyms are in the synonym column, not like current GAF
Original comment by: ValWood