Closed ValWood closed 3 years ago
Table2.
Distant orthologs.
https://docs.google.com/spreadsheets/d/1nAnXmBmO1qNFXRV7AnxqfhPAMdHCBwDTxPezsbUOlaM/edit?usp=sharing
There will likely be many more of these, without protein families once I can see which conserved 1:1 in Pombase are missing from Japonicus DB, but these are some I found which had protein families, but were missed by Compara and the Rhind orthologs.
Format
(I have one more file where I am resolving/naming the histones and ribosomal proteins- I should be able to finish that today or tomorrow)
For these all of the GO annotation can be transferred.
Table 3 here is the 3rd file mapping the many to many exact duplicates
https://docs.google.com/spreadsheets/d/1ohvDoSBx8AEkX7_COEt0yLCDtKhUPxMphSH6ui52ogk/edit?usp=sharing
For the purposes of naming you can use
I added 2 extra columns
Also it seemed that
SJAG_00336 | hhf1 SJAG_03542 | hhf2 in this table do not have orthologs in Compara, I'm not sure why - so may need another way to get GO annotation onto these.
I'm adding the names and products now because it's quick. I'll sort the synonyms out later.
https://docs.google.com/spreadsheets/d/1ohvDoSBx8AEkX7_COEt0yLCDtKhUPxMphSH6ui52ogk/edit?usp=sharing
This line (row 72) needs a Japonicus ID in the first column:
SPAC3A12.10 rpl20Aa rpl20 60S ribosomal protein L20A SPAC3A12.10
Table2. Distant orthologs. https://docs.google.com/spreadsheets/d/1nAnXmBmO1qNFXRV7AnxqfhPAMdHCBwDTxPezsbUOlaM/edit?usp=sharing
I've added these names, products and orthologs to JaponicusDB. The orthologs are in this new file of manual orthologs: https://github.com/japonicusdb/japonicus-curation/blob/main/manual_pombe_orthologs.tsv
I created some files with lists of gene products and gene names: Table 1. https://docs.google.com/spreadsheets/d/11i1s-mVk-GxxOFuN0MEunz9Zcb970bNap-yUZ9Jz4ew/edit?usp=sharing japonicus_expanded_families_names_products
There are some inconsistencies in this table where genes appear twice but with different products:
SJAG_00240 alcohol dehydrogenase Adh4-like 2:1
SJAG_00240 iron-type alcohol dehydrogenase Adh4-like 2:1
SJAG_00242 Aldo-keto reductase family protein 3:1
SJAG_00242 NADP-dependent oxidoreductase domain many:many
SJAG_00500 RecQ type DNA helicase, telomeric many:many
SJAG_00500 zinc finger, RING-type, Rad16-like 2:1
SJAG_01953 DEAD/DEAH box helicase 0
SJAG_01953 RecQ type DNA helicase, telomeric many:many
SJAG_01968 aspartic peptidase domain superfamily 0
SJAG_01968 sxa2 aspartic protease, S. pombe Sxa1-like many:1
SJAG_02973 methyltransferase type 11 many:many
SJAG_02973 trans-aconitate 3-methyltransferase 1:2
SJAG_03613 RecQ type DNA helicase, telomeric many:many
SJAG_03613 zinc finger, RING-type, Rad16-like 2:1
SJAG_03821 Aldo-keto reductase family protein 3:1
SJAG_03821 NADP-dependent oxidoreductase domain many:many
SJAG_03822 alcohol dehydrogenase Adh4-like 2:1
SJAG_03822 iron-type alcohol dehydrogenase-like 2:1
SJAG_05118 aspartic peptidase domain superfamily 0
SJAG_05118 zinc finger, CCHC-type 0
Table 3 here is the 3rd file mapping the many to many exact duplicates https://docs.google.com/spreadsheets/d/1ohvDoSBx8AEkX7_COEt0yLCDtKhUPxMphSH6ui52ogk/edit?usp=sharing
There are two genes in this table with inconsistent products:
SJAG_01411 rpp1a rpp1 60S acidic ribosomal protein SPAC644.15
SJAG_01411 rpp1a rpp1 ribosomal protein P1 SPAC644.15
SJAG_04020 rpp1b rpp1 60S acidic ribosomal protein SPBC3B9.13c
SJAG_04020 rpp1b rpp1 ribosomal protein P1 SPBC3B9.13c
Table 3 fixed. Note to self, fix PomBase product to "ribosomal protein P1"
Table 2
There are some inconsistencies in this table where genes appear twice but with different products:
Well that was a bollocks! I did have a problem when I thought an action that deleted an entire row only removed a cell and moved the rows out of kilter, but I went through and checked everything so I have no idea how this happened. Will fix. Hopefully later today, if not Monday or Tuesday.
I have some nifty query checks that I can do to check that are all OK once loaded.
Well that was a bollocks!
:-)
Will fix. Hopefully later today, if not Monday or Tuesday.
Cheers! No rush though. It's Friday night here so I'm finishing up until Monday.
correction:
This line (row 72) needs a Japonicus ID in the first column:
This one is fixed. I also fixed a couple of the other 'standard names' names which were blank.
Table 1
There are some inconsistencies in this table where genes appear twice but with different products:
I went through this table and checked everything. I corrected these errors and mad a few other changes. Could you re-import the data associated with this table?
Also, I was thinking we already have a mechanism to filter dodgy GO inferences for fission yeast so we should transfer GO annotations for all pombe orthologs (many/many many/one etc) and then use the same filtering system to filter the ones that should not be transferred. The pombase ID will be int he with field, so if we had a japonicus version of this file:
pombe-embl/goa-load-fixes/filtered_mappings
I could add the PomBase IDs for which we want to filter the inferred GO annotations and overwrite with manual annotation. Does that make sense? If not I can explain....It seems silly to have an additional mechanism ?
(Also All of the existing GO annotation filters in filtered_GO_IDs and filtered_mappings should be observed for japonicus.
I found some more proteins without products but with domains. I added these (around 70-80) to table 1.
There are a few more ribosomal name resolutions to add table 2. I will do these after the table 1 data is updated so I can check I don't miss anything else.
There are still some genes in table 1 that appear more that once, mostly with different products:
SJAG_00238 SJAG_02118 SJAG_02131 SJAG_02156 SJAG_02666 (appears three times) SJAG_03826
And in table 3, SJAG_04188 and SJAG_02861 appear twice. But the names and products are the same.
I've done a new load after removing duplicates. The name and product file for loading is here: https://github.com/japonicusdb/japonicus-curation/blob/main/names_and_products.tsv
Bummer! I will never use Google sheets again. Sometimes when you are searching for an ID it replaces an ID in the column the cursor is in. I saw it happen a couple of times, the behaviour is really unintuitive. Once I fix these everything should be correct.
OK, that wasn't the reason. This was me standardizing names across families and forgetting to delete the old lines. These are now all sorted in table1.... Is it OK to edit the tables for a little longer? (easier than GitHub and I can capture my columns that aren't loaded into Chado). Once we have all we can switch to the GitHub contig.
Is it OK to edit the tables for a little longer?
No problem. Let me know when you'd like the TSV file updated.
And in table 3, SJAG_04188 and SJAG_02861 appear twice. But the names and products are the same.
fixed
Also added
SJAG_04808
SJAG_04833
SJAG_00025
SJAG_02944
SJAG_02134
SJAG_02134
SJAG_02134
SJAG_01093
SJAG_04836
SJAG_04799
SJAG_02153
SJAG_02676
SJAG_04797
SJAG_04835
SJAG_04807
to table 1
Should I update the TSV file or are you still working on this?
Working on it right now. I'll let you know if I finish- will hopefully finish tonight. It's been a long day but I have so few left I hope I can get to the end.
After that, we can switch. to the config file for edits.
V
OK I finished the updates. I altered tables: japonicus_expanded_families_names_products and japonicus_duplicates
can you check that they parse OK? SJAG_02597 in japonicus_duplicates had no product right now, but I think it might have been due to a space in the systematic ID field. I removed this.
From now on I can move to editing the genes and products file in GitHub. You will need to remind me where this is. Also the new ortholog table.
japonicus_duplicates
There is one inconsistency in that table, same ID but different name:
SJAG_00336 hhf1 histone H4 h4.1
SJAG_00336 hhf2 histone H4 h4.2
fixed to SJAG_03542 | hhf2
it is odd because that is what the site shows currently???
it is odd because that is what the site shows currently???
"SJAG_03542 | hhf2" was in the previous version of the TSV file which is what's currently on the site.
I'm about to start a re-load with the new TSV file.
I'm about to start a re-load with the new TSV file.
That's done now and the site is reloaded. Let me know if you spot any problems.
"SJAG_03542 | hhf2" was in the previous version of the TSV file which is what's currently on the site.
Yes, I don't know how that got edited....
Anyway, I checked and it all looks good! Only 230 protein products to assign!
Shall we close this one?
Yes!
I created some files with lists of gene products and gene names:
Table 1.
https://docs.google.com/spreadsheets/d/11i1s-mVk-GxxOFuN0MEunz9Zcb970bNap-yUZ9Jz4ew/edit?usp=sharing
japonicus_expanded_families_names_products This has 5 columns
We can change the format if required, as long as there is an editable table somewhere.