japonicusdb / japonicus-config

Configuration for JaponicusDB
0 stars 1 forks source link

more gene products and orthologs for japonicus #27

Closed ValWood closed 3 years ago

ValWood commented 3 years ago

I created some files with lists of gene products and gene names:

Table 1.

https://docs.google.com/spreadsheets/d/11i1s-mVk-GxxOFuN0MEunz9Zcb970bNap-yUZ9Jz4ew/edit?usp=sharing

japonicus_expanded_families_names_products This has 5 columns

  1. systematic_id
  2. primary_name (if I could assign)
  3. product
  4. conservation_note ( Kim you don't need to do anything with this, it was for my own purposes so I could see the different biology, Snezka will probably be interested in this detail)
  5. GO transfer. Most of the GO annotation can be transferred from the orthologs -even if there are multiple paralogs, because the annotation on these is identical (but the redundancy will be removed by the filtering pipeline*). Most of these gene products are not well studied because they aren't 'core genes'. For cases where the GO annotations can not be safely transferred because they are too specific to apply to alI put NO in the GO transfer column.

We can change the format if required, as long as there is an editable table somewhere.

ValWood commented 3 years ago

Table2.

Distant orthologs.

https://docs.google.com/spreadsheets/d/1nAnXmBmO1qNFXRV7AnxqfhPAMdHCBwDTxPezsbUOlaM/edit?usp=sharing

There will likely be many more of these, without protein families once I can see which conserved 1:1 in Pombase are missing from Japonicus DB, but these are some I found which had protein families, but were missed by Compara and the Rhind orthologs.

Format

  1. Systematic ID
  2. Primary name
  3. Product
  4. Fission yeast ortholog

(I have one more file where I am resolving/naming the histones and ribosomal proteins- I should be able to finish that today or tomorrow)

For these all of the GO annotation can be transferred.

ValWood commented 3 years ago

Table 3 here is the 3rd file mapping the many to many exact duplicates

https://docs.google.com/spreadsheets/d/1ohvDoSBx8AEkX7_COEt0yLCDtKhUPxMphSH6ui52ogk/edit?usp=sharing

For the purposes of naming you can use

  1. systematic_ID
  2. primary_name
  3. synonym
  4. product

I added 2 extra columns

  1. syntenic with | in case we ever want to build 'synteny blocks' 6 synteny_note this is where the duplicates were at a synteny breakpoint. I thought this was interesting because they seemed to be overrepresented at these 'exact duplicate' features (there is supposedly some mechanism to maintain these high copy number proteins in duplicate, which might be linked in some way to the synteny disruption (translocations).
ValWood commented 3 years ago

Also it seemed that

SJAG_00336 | hhf1 SJAG_03542 | hhf2 in this table do not have orthologs in Compara, I'm not sure why - so may need another way to get GO annotation onto these.

kimrutherford commented 3 years ago

I'm adding the names and products now because it's quick. I'll sort the synonyms out later.

https://docs.google.com/spreadsheets/d/1ohvDoSBx8AEkX7_COEt0yLCDtKhUPxMphSH6ui52ogk/edit?usp=sharing

This line (row 72) needs a Japonicus ID in the first column:

SPAC3A12.10   rpl20Aa   rpl20    60S ribosomal protein L20A    SPAC3A12.10
kimrutherford commented 3 years ago

Table2. Distant orthologs. https://docs.google.com/spreadsheets/d/1nAnXmBmO1qNFXRV7AnxqfhPAMdHCBwDTxPezsbUOlaM/edit?usp=sharing

I've added these names, products and orthologs to JaponicusDB. The orthologs are in this new file of manual orthologs: https://github.com/japonicusdb/japonicus-curation/blob/main/manual_pombe_orthologs.tsv

kimrutherford commented 3 years ago

I created some files with lists of gene products and gene names: Table 1. https://docs.google.com/spreadsheets/d/11i1s-mVk-GxxOFuN0MEunz9Zcb970bNap-yUZ9Jz4ew/edit?usp=sharing japonicus_expanded_families_names_products

There are some inconsistencies in this table where genes appear twice but with different products:

SJAG_00240            alcohol dehydrogenase Adh4-like 2:1
SJAG_00240            iron-type alcohol dehydrogenase Adh4-like       2:1
SJAG_00242            Aldo-keto reductase family protein      3:1
SJAG_00242            NADP-dependent oxidoreductase domain    many:many
SJAG_00500            RecQ type DNA helicase, telomeric       many:many
SJAG_00500            zinc finger, RING-type, Rad16-like      2:1
SJAG_01953            DEAD/DEAH box helicase  0
SJAG_01953            RecQ type DNA helicase, telomeric       many:many
SJAG_01968            aspartic peptidase domain superfamily   0
SJAG_01968    sxa2    aspartic protease, S. pombe Sxa1-like   many:1
SJAG_02973            methyltransferase type 11       many:many
SJAG_02973            trans-aconitate 3-methyltransferase     1:2
SJAG_03613            RecQ type DNA helicase, telomeric       many:many
SJAG_03613            zinc finger, RING-type, Rad16-like      2:1
SJAG_03821            Aldo-keto reductase family protein      3:1
SJAG_03821            NADP-dependent oxidoreductase domain    many:many
SJAG_03822            alcohol dehydrogenase Adh4-like 2:1
SJAG_03822            iron-type alcohol dehydrogenase-like    2:1
SJAG_05118            aspartic peptidase domain superfamily   0
SJAG_05118            zinc finger, CCHC-type  0
kimrutherford commented 3 years ago

Table 3 here is the 3rd file mapping the many to many exact duplicates https://docs.google.com/spreadsheets/d/1ohvDoSBx8AEkX7_COEt0yLCDtKhUPxMphSH6ui52ogk/edit?usp=sharing

There are two genes in this table with inconsistent products:

SJAG_01411  rpp1a   rpp1    60S acidic ribosomal protein    SPAC644.15
SJAG_01411  rpp1a   rpp1    ribosomal protein P1    SPAC644.15
SJAG_04020  rpp1b   rpp1    60S acidic ribosomal protein    SPBC3B9.13c
SJAG_04020  rpp1b   rpp1    ribosomal protein P1    SPBC3B9.13c
ValWood commented 3 years ago

Table 3 fixed. Note to self, fix PomBase product to "ribosomal protein P1"

ValWood commented 3 years ago

Table 2

There are some inconsistencies in this table where genes appear twice but with different products:

Well that was a bollocks! I did have a problem when I thought an action that deleted an entire row only removed a cell and moved the rows out of kilter, but I went through and checked everything so I have no idea how this happened. Will fix. Hopefully later today, if not Monday or Tuesday.

I have some nifty query checks that I can do to check that are all OK once loaded.

kimrutherford commented 3 years ago

Well that was a bollocks!

:-)

Will fix. Hopefully later today, if not Monday or Tuesday.

Cheers! No rush though. It's Friday night here so I'm finishing up until Monday.

ValWood commented 3 years ago

correction:

This line (row 72) needs a Japonicus ID in the first column:

This one is fixed. I also fixed a couple of the other 'standard names' names which were blank.

ValWood commented 3 years ago

Table 1

There are some inconsistencies in this table where genes appear twice but with different products:

I went through this table and checked everything. I corrected these errors and mad a few other changes. Could you re-import the data associated with this table?

Also, I was thinking we already have a mechanism to filter dodgy GO inferences for fission yeast so we should transfer GO annotations for all pombe orthologs (many/many many/one etc) and then use the same filtering system to filter the ones that should not be transferred. The pombase ID will be int he with field, so if we had a japonicus version of this file:

pombe-embl/goa-load-fixes/filtered_mappings

I could add the PomBase IDs for which we want to filter the inferred GO annotations and overwrite with manual annotation. Does that make sense? If not I can explain....It seems silly to have an additional mechanism ?

(Also All of the existing GO annotation filters in filtered_GO_IDs and filtered_mappings should be observed for japonicus.

ValWood commented 3 years ago

I found some more proteins without products but with domains. I added these (around 70-80) to table 1.

There are a few more ribosomal name resolutions to add table 2. I will do these after the table 1 data is updated so I can check I don't miss anything else.

kimrutherford commented 3 years ago

There are still some genes in table 1 that appear more that once, mostly with different products:

SJAG_00238 SJAG_02118 SJAG_02131 SJAG_02156 SJAG_02666 (appears three times) SJAG_03826

And in table 3, SJAG_04188 and SJAG_02861 appear twice. But the names and products are the same.

I've done a new load after removing duplicates. The name and product file for loading is here: https://github.com/japonicusdb/japonicus-curation/blob/main/names_and_products.tsv

ValWood commented 3 years ago

Bummer! I will never use Google sheets again. Sometimes when you are searching for an ID it replaces an ID in the column the cursor is in. I saw it happen a couple of times, the behaviour is really unintuitive. Once I fix these everything should be correct.

ValWood commented 3 years ago

OK, that wasn't the reason. This was me standardizing names across families and forgetting to delete the old lines. These are now all sorted in table1.... Is it OK to edit the tables for a little longer? (easier than GitHub and I can capture my columns that aren't loaded into Chado). Once we have all we can switch to the GitHub contig.

kimrutherford commented 3 years ago

Is it OK to edit the tables for a little longer?

No problem. Let me know when you'd like the TSV file updated.

ValWood commented 3 years ago

And in table 3, SJAG_04188 and SJAG_02861 appear twice. But the names and products are the same.

fixed

Also added SJAG_04808 SJAG_04833 SJAG_00025 SJAG_02944 SJAG_02134 SJAG_02134 SJAG_02134
SJAG_01093
SJAG_04836 SJAG_04799 SJAG_02153 SJAG_02676 SJAG_04797 SJAG_04835 SJAG_04807 to table 1

kimrutherford commented 3 years ago

Should I update the TSV file or are you still working on this?

ValWood commented 3 years ago

Working on it right now. I'll let you know if I finish- will hopefully finish tonight. It's been a long day but I have so few left I hope I can get to the end.

After that, we can switch. to the config file for edits.

V

ValWood commented 3 years ago

OK I finished the updates. I altered tables: japonicus_expanded_families_names_products and japonicus_duplicates

can you check that they parse OK? SJAG_02597 in japonicus_duplicates had no product right now, but I think it might have been due to a space in the systematic ID field. I removed this.

From now on I can move to editing the genes and products file in GitHub. You will need to remind me where this is. Also the new ortholog table.

kimrutherford commented 3 years ago

japonicus_duplicates

There is one inconsistency in that table, same ID but different name:

 SJAG_00336     hhf1    histone H4 h4.1
 SJAG_00336     hhf2    histone H4 h4.2
ValWood commented 3 years ago

fixed to SJAG_03542 | hhf2

it is odd because that is what the site shows currently???

kimrutherford commented 3 years ago

it is odd because that is what the site shows currently???

"SJAG_03542 | hhf2" was in the previous version of the TSV file which is what's currently on the site.

I'm about to start a re-load with the new TSV file.

kimrutherford commented 3 years ago

I'm about to start a re-load with the new TSV file.

That's done now and the site is reloaded. Let me know if you spot any problems.

ValWood commented 3 years ago

"SJAG_03542 | hhf2" was in the previous version of the TSV file which is what's currently on the site.

Yes, I don't know how that got edited....

Anyway, I checked and it all looks good! Only 230 protein products to assign!

kimrutherford commented 3 years ago

Shall we close this one?

ValWood commented 3 years ago

Yes!