pombase / pombase-chado

PomBase code for accessing Chado
MIT License
5 stars 3 forks source link

GAF export from Chado #28

Closed pombase-admin closed 9 years ago

pombase-admin commented 13 years ago

when we export the GO data we need to make sure that only synonyms are in the synonym column, not like current GAF

Original comment by: ValWood

pombase-admin commented 12 years ago

I'll do the GAF export first as it's easier and probably more immediately useful. It's also easy to test because we can compare the output to the current GeneDB GAF files.

Original comment by: kimrutherford

pombase-admin commented 12 years ago

extending this item to cover phenotype annotations

Sent this to Chris M when done

Original comment by: ValWood

pombase-admin commented 12 years ago

Original comment by: ValWood

pombase-admin commented 12 years ago

GPAD/GPI documentation http://www.geneontology.org/GO.format.gpi.shtml http://www.geneontology.org/GO.format.gpad.shtml Not sure this documentation is up to date, check before use

Original comment by: ValWood

pombase-admin commented 12 years ago

Also we need to replace some "Mappign files" generated from GeneDB They are all simple tab delimited format

Mapping files described here: http://www.pombase.org/downloads/data-mapping

Original comment by: ValWood

pombase-admin commented 12 years ago

Original comment by: kimrutherford

pombase-admin commented 12 years ago

Apparently I implemented some of this back in January. I don't remember doing it, but it probably wasn't someone else.

The code as is doesn't write out the annotation extensions, so that still needs doing.

Original comment by: kimrutherford

pombase-admin commented 12 years ago

I've added a separate tracker item for the GeneDB style mapping files: https://sourceforge.net/tracker/?func=detail&aid=3541336&group\_id=65526&atid=2096276

Original comment by: kimrutherford

pombase-admin commented 12 years ago

When we do the export we need to also create a ND mapping for when a gene product is missing a particular aspect.

I checked the weekend before I went away that these were all valid (most are for conserved unknowns and sequence orphans, and as far as I am aware there are no outstanding appers for any of these.

The numbers will be quite low (see Venn attached) (in fact the numbers will be slightly less than this because I managed to squeeze a few more ISS annotations in)

Original comment by: ValWood

pombase-admin commented 12 years ago

Original comment by: ValWood

pombase-admin commented 12 years ago

Only add NDs for protein coding genes. Add an ND annotation for each aspect that has no other annotation.

Original comment by: kimrutherford

pombase-admin commented 12 years ago

Does it make sense for pseudogenes to have GO annotation?

If so, should the annotation be exported to the GAF file?: http://www.pombase.org/spombe/result/SPAC23D3.05c

Original comment by: kimrutherford

pombase-admin commented 12 years ago

Your question has prompted me to trawl through a fairly old (March 2006) exchange on GO & SO mailing lists, which can be scary. It starts here:

https://mailman.stanford.edu/pipermail/go-discuss/2006-March/001782.html

... goes on and on, and some of it comes out in separate threads in the archive:

https://mailman.stanford.edu/pipermail/go-discuss/2006-March/thread.html

... and it boils down to "it's complicated". But I think some usable simple answers would be ...

> Does it make sense for pseudogenes to have GO annotation?

Probably not; maybe with some rare exceptions, but if we did just say "never" we probably wouldn't lose much.

The exceptions would come up if someone finds that a "pseudogene" is transcribed, and the transcript does something, usually regulatory, e.g. acts as antisense RNA for a "live" copy of the gene. But the email exchange included arguments that if that happens the feature shouldn't be called a pseudogene anyway ... but what if the community expect to see it called a pseudogene ... argh argh aaargh.

That leads me to the second simple answer:

> If so, should the annotation be exported to the GAF file?

No. As long as we call something a pseudogene, we should not export any GO annotations for it to the GAF, even if we want to have the annotations for one of the exceptional circumstances. We don't want to piss them off, and we really don't want to reopen the can of pseudo-worms!

m

Original comment by: mah11

pombase-admin commented 12 years ago

Right so. Thanks for that.

I won't include the pseudogenes in the GAF output.

Unfortunately I've meanwhile found a worse anomaly in Chado - there are lots of annotations without evidence code. I'm trying to track down how that happened.

Original comment by: kimrutherford

pombase-admin commented 12 years ago

This fits with what I would say. At present i delete psudos before submission. if a psudo warrnts any kind of annotation we would probably remove the pseudo tag and make another feature like a non coding RNA with a regultory role

Original comment by: ValWood

pombase-admin commented 12 years ago

The Chado database was missing some dates too - all dates from the input GAF files weren't being stored. That's fixed too and I'm now re-loading. I don't think that's a problem for Ensembl as the dates aren't soon on pombase.org.

Original comment by: kimrutherford

pombase-admin commented 12 years ago

The GAF writing is coming along, there are still some problems to fix but it now puts things like "happens_during(GO:0071276),has_regulation_target(SPBC660.07)" in column 16.

You've probably told me before, but do we need to put anything in column 17? This page implies that it's optional: http://www.geneontology.org/GO.format.gaf-2\_0.shtml

Original comment by: kimrutherford

pombase-admin commented 12 years ago

> do we need to put anything in column 17?

It's only desirable for the tiny number of annotations where there's a tag like "column_17=PR:000027503;". Even for them, it's optional on GO's end, and the reason to include it is to provide a bit more specificity about what form of a gene product is doing the business. If you do include it, just write out the "PR:000027503" part.

There will probably be more column 17 entries in future, but they'll accumulate slowly.

m

Original comment by: mah11

pombase-admin commented 12 years ago

Thanks. It should be easy to add it where there is a "column_17" property in Chado.

Original comment by: kimrutherford

pombase-admin commented 12 years ago

might as well do it then

(and see if it freaks 'em out as much as it did when we put stuff in column 16 ;) )

Original comment by: mah11

pombase-admin commented 12 years ago

It turns out that I had implement the column 17 stuff months ago and then forgot. So it's already done.

Original comment by: kimrutherford

pombase-admin commented 12 years ago

There's another problem with dates. Some are stored in the GO way "20121022" (from GAF files) and some are stored in the ISO standard way "2012-10-22" (from the curation tool). The GAF files I'm writing have both formats, which the GAF file checker disapproves of. I'll rationalise things.

Original comment by: kimrutherford

pombase-admin commented 12 years ago

This is mostly done. The GO filter-gene-association.pl script now reports only 20-ish errors, which we are looking at.

I haven't added the ND annotations. I'll do that next.

Original comment by: kimrutherford

pombase-admin commented 12 years ago

The ND rows are now in, with no complaints from filter-gene-association.pl

Original comment by: kimrutherford

pombase-admin commented 12 years ago

You probably know this but column 12 (the feature type) can contain the SO feature ID. val

Original comment by: ValWood

pombase-admin commented 12 years ago

Do you think a SO ID would be better? It's easy enough to change.

Original comment by: kimrutherford

pombase-admin commented 12 years ago

I think so, it is the newer way, and I'm not sure but I think GPAD expects it, so it would make both the same

Original comment by: ValWood

pombase-admin commented 12 years ago

I think the GAF part is done, but I haven't started on the FYPO export yet.

Original comment by: kimrutherford

pombase-admin commented 11 years ago

Diff:


--- old
+++ new
@@ -1,4 +1,3 @@
-

 when we export the GO data we need to make sure that only synonyms are in the synonym column, not like current GAF

Original comment by: ValWood

pombase-admin commented 11 years ago

Curators need to spec out format, then this can be raised.

Also probably a separate ticket for GO GPAD/GPI?

Original comment by: ValWood

pombase-admin commented 10 years ago

Diff:


--- old
+++ new
@@ -1,3 +1,2 @@
-
 when we export the GO data we need to make sure that only synonyms are in the synonym column, not like current GAF

Original comment by: mah11

pombase-admin commented 10 years ago

should have been closed ages ago

Original comment by: mah11