Closed kimrutherford closed 3 months ago
It is because they are in different files
conserved_multi.txt:SPBC215.05 GPD1,GPD1L compara_orths.tsv:SPAC23D3.04c HGNC:4455,HGNC:28956
But Compara isn't necessarily the source of these despite the names of the files. They have all been checked and many added or removed over time. I should create a PomBase reference for these.
Basically they were all manually curated (originally) , but we used Compara as a source of the 1:1's identifiers originally (because I did not record the human identifier)
However many were incorrect, and I had to check them all.
The different files are now largely nonsense as you can see they all have a mixture of 1:1 1:many and many:many so we could merge into a single file and drop the Campara reference.
Ideally we would have support from other sources, we can discuss this but it is a bit tricky. Panther would have the best coverage, but for many, there will be no supporting reference from an ortholog predictor. I try to add a reference for these.
I can explain the issue better on call tomorrow....
Thanks for the explanation. I was worried that it might be a bug.
so we could merge into a single file and drop the Campara reference.
I can see one Compara file:
pombe-embl/orthologs/compara_orths.tsv
What are the others?
There is only one "Compara" file (this file has been heavily revised over time)
but there are 2 other files
compara_orths.tsv conserved_one_to_one.txt conserved_multi.txt and now this one human_orthologs_from_contigs.tsv
I think we could have all human orthologs in a single file. Separating them is largely historical from how the original files were created. In fact, conserved multi often has 1:1 and conserved one to one often has multi , they have usually been edited in the file they ended up in which was sometimes arbitrary.
A couple of things
I noticed that some of the files use HGNC IDs (I think you asked me about this earlier in the week, but I thought it was for the downloadable files). For maintaining the manual orthologs I would rather use the names. This way i) I get alerted in the logs when a name changes. This is useful because it often means that a human gene is characterised, and b) it is quicker because I am always using the names, and I don't need to go and look up the ID. c) it also helps me to see that clusters are sensible ie. SPBC83.07 KDM4D,KDM4B,KDM4A,KDM4C - although this doesn't always work, it's a good indicator
I am confused because I expected the sum of these 4 files to be 3639 rows (i.e. the number of pombe genes with human orthologs) , but it's 4325. Even when I unique on genes it's still 3945?
For maintaining the manual orthologs I would rather use the names.
No problem, I'll change them to names.
I am confused because I expected the sum of these 4 files to be 3639 rows (i.e. the number of pombe genes with human orthologs) , but it's 4325. Even when I unique on genes it's still 3945?
I get 3638 with this:
awk '{print $1}' conserved_*.txt compara_orths.tsv human_orthologs_from_contigs.tsv | grep -v uniquename | sort -u | wc -l
One ortholog isn't loading ("SPAC11E3.01c SRCAP"), so that number matches Chado which has 3637 human orthologs.
I see, I was doing the query wrong, but also the new file has
human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:18730 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4751 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4747 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4753 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4752 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4755 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4761 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:13954 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4748 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4750 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4758 2021-08-27
can we change the curated files to SPAC6G9.04 PSD,PSD2,PSD3,PSD4
or it will be difficult to maintain (necessary) changes as things are resolved
especially for multis like SPBC4C3.09 GYG1,GYG2,GYG2P1 SPBC4C3.08 GYG1,GYG2,GYG2P1 SPAC5H10.12c GYG1,GYG2,GYG2P1
its easier to tell that the annotation is consistent
fixing !
can we change the curated files to SPAC6G9.04 PSD,PSD2,PSD3,PSD4
What should we do with the references, dates and qualifiers in those cases? In Chado there is a single annotation for each ortholog and each annotation can only one reference/publication.
I had a quick look. Mostly all the orthologs for each gene have either no reference or they have the Compara paper (PMID:19029536) as a reference. If we removed the Compara reference that would help things. But there are few still cases where there are multiple orthologs that have different references or dates. For example:
SPAC25H1.06 RBBP4 PMID:18761674 2009-07-23
SPAC25H1.06 RBBP7 PMID:19029536
In this case only one of the orthologs has a date (the one from the contigs):
SPBC660.15 HGNC:7330,HGNC:2683
SPBC660.15 HGNC:18585 2010-02-17
Would a mixture be OK? With most annotations on one line ("SPAC6G9.04 PSD,PSD2,PSD3,PSD4"), where the reference and dates are the same, and the exceptions split onto multiple lines?
Just so you know, the Compara reference PMID:19029536 is automatically attached to all orthologs from compara_orths.tsv
when they are loaded into Chado.
For maintaining the manual orthologs I would rather use the names.
Just to double check, when I moved the orthologs out of the contigs I asked about this and you thought IDs would be better:
IDs. I like names, but IDs makes more sense...
Did I misunderstand?
No I did a U-turn. When I was thinking about the question I was thinking about the files we would export , and I guess these should be IDs rather than names (which I think they are?). But for curating names is MUCH easier and has other advantages too. (human readability)
SOrry about that, I wasn't thinking about he curation part!
What should we do with the references, dates and qualifiers in those cases?
Hmm, I thought that each gene only had curation in one place. Usually when I update I would always ensure that the existing annotations were still valid so the data is really a time stamp that applies to all of the orthologs in the set. (if there are multiple).
It is likely that any reference would apply to the set too. I can check this is the case. If ever a reference does not apply to all of them, then in these cases I can split the lines.
Does that make sense?
But for curating names is MUCH easier and has other advantages too. (human readability)
OK, no problem. I'll change the files to use names tomorrow.
In the exporting, the cerevisiae file contains IDs (YPL041C etc.) but the human export file contains names.
There's a comment about the cerevisiae file here: https://github.com/pombase/pombase-chado/issues/960#issuecomment-1094707218
Does that make sense?
Yep, that makes sense.
What should do about the dates, like in this case:
SPBC660.15 HGNC:7330,HGNC:2683
SPBC660.15 HGNC:18585 2010-02-17
Should the date apply to all three ortholog annotations?
In the download human used IDs (I.e. HGNC:18585 ). This is the closest we have to a locus tag/systematic identifier (human don't have these)
We use a systematic identifier for pombe and S. cerevisiae, so I think we are as consistent as we can be for downloads.
SPBC660.15 HGNC:7330,HGNC:2683 SPBC660.15 HGNC:18585 2010-02-17
I'm confused about this because it implies the orthologs were curated in different places. I thought I always curated them in the same place so there might be an issue here. I will check this one. Are there others like this?
For this one SPBC660.15 MSI,MSI2 PANTHER:PTHR48032 2024-08-22 will solve.
Are there others like this?
Are there others like this?
Here are all the cases where there is an ortholog with a date and the pombe gene also has an ortholog annotation without a date:
pombe | human | date | file | reference |
---|---|---|---|---|
SPAPB1A10.07c | HGNC:32237,HGNC:18825 | compara_orths.tsv | ||
SPAPB1A10.07c | HGNC:13464 | 2010-09-08 | human_orthologs_from_contigs.tsv | |
SPAPB1A10.07c | HGNC:23231 | 2010-09-08 | human_orthologs_from_contigs.tsv | |
SPAPB1A10.07c | HGNC:11699 | 2010-09-08 | human_orthologs_from_contigs.tsv | |
SPACUNK4.08 | HGNC:3010,HGNC:3590,HGNC:20823 | compara_orths.tsv | ||
SPACUNK4.08 | HGNC:3009 | 2010-08-27 | human_orthologs_from_contigs.tsv | |
SPAC20H4.03c | HGNC:28277,HGNC:11614,HGNC:11615 | compara_orths.tsv | ||
SPAC20H4.03c | HGNC:11612 | 2010-02-17 | human_orthologs_from_contigs.tsv | |
SPBC660.15 | HGNC:7330,HGNC:2683 | compara_orths.tsv | ||
SPBC660.15 | HGNC:18585 | 2010-02-17 | human_orthologs_from_contigs.tsv | |
SPBC23E6.09 | HGNC:29012,HGNC:12638 | compara_orths.tsv | ||
SPBC23E6.09 | HGNC:12637 | 2010-02-17 | human_orthologs_from_contigs.tsv | |
SPAC13G6.04 | HGNC:11817 | 2012-03-04 | human_orthologs_from_contigs.tsv | |
SPAC13G6.04 | HGNC:11818 | compara_orths.tsv | ||
SPAC13A11.04c | HGNC:23086 | compara_orths.tsv | ||
SPAC13A11.04c | HGNC:12621 | 2010-10-01 | human_orthologs_from_contigs.tsv | |
SPAC1565.07c | HGNC:30688 | 2010-09-19 | human_orthologs_from_contigs.tsv | |
SPAC1565.07c | HGNC:30689 | compara_orths.tsv | ||
SPAC5D6.13 | HGNC:24882 | compara_orths.tsv | ||
SPAC5D6.13 | HGNC:15452 | 2009-08-17 | human_orthologs_from_contigs.tsv | |
SPAC23C11.17 | HGNC:14648 | compara_orths.tsv | ||
SPAC23C11.17 | HGNC:6556 | 2010-06-14 | human_orthologs_from_contigs.tsv | |
SPAC25H1.06 | HGNC:9890 | compara_orths.tsv | ||
SPAC25H1.06 | HGNC:9887 | 2009-07-23 | human_orthologs_from_contigs.tsv | PMID:18761674 |
SPAC688.03c | HGNC:28658 | compara_orths.tsv | ||
SPAC688.03c | HGNC:467 | 2010-07-18 | human_orthologs_from_contigs.tsv | |
SPAC688.11 | HGNC:18415 | compara_orths.tsv | ||
SPAC688.11 | HGNC:4913 | 2010-02-17 | human_orthologs_from_contigs.tsv | |
SPAC959.02 | HGNC:15751 | compara_orths.tsv | ||
SPAC959.02 | HGNC:7641 | 2008-04-14 | human_orthologs_from_contigs.tsv | |
SPAC8C9.11 | HGNC:32479 | compara_orths.tsv | ||
SPAC8C9.11 | HGNC:29488 | 2010-07-18 | human_orthologs_from_contigs.tsv | |
SPAC19G12.07c | HGNC:20155 | compara_orths.tsv | ||
SPAC19G12.07c | HGNC:15923 | 2008-06-24 | human_orthologs_from_contigs.tsv | |
SPAC4F10.15c | HGNC:12735 | compara_orths.tsv | ||
SPAC4F10.15c | HGNC:12731 | 2010-02-17 | human_orthologs_from_contigs.tsv | |
SPBC800.10c | HGNC:3419 | 2010-02-17 | human_orthologs_from_contigs.tsv | |
SPBC800.10c | HGNC:24634 | compara_orths.tsv | ||
SPBC902.04 | HGNC:29243 | compara_orths.tsv | ||
SPBC902.04 | HGNC:20327 | 2009-10-04 | human_orthologs_from_contigs.tsv | |
SPBC354.03 | HGNC:12757 | 2009-08-01 | human_orthologs_from_contigs.tsv | |
SPBC354.03 | HGNC:17826 | compara_orths.tsv | ||
SPBC119.06 | HGNC:10603 | 2011-04-07 | human_orthologs_from_contigs.tsv | |
SPBC119.06 | HGNC:10604 | compara_orths.tsv | ||
SPBC337.13c | HGNC:19901 | compara_orths.tsv | ||
SPBC337.13c | HGNC:16963 | 2008-04-14 | human_orthologs_from_contigs.tsv | |
SPBC19C2.12 | HGNC:14517 | compara_orths.tsv | ||
SPBC19C2.12 | HGNC:14027 | 2021-02-28 | human_orthologs_from_contigs.tsv | |
SPBC28E12.06c | HGNC:20751 | 2010-02-22 | human_orthologs_from_contigs.tsv | |
SPBC28E12.06c | HGNC:29323 | compara_orths.tsv | ||
SPBC32F12.04 | HGNC:12417 | 2012-03-03 | human_orthologs_from_contigs.tsv | |
SPBC32F12.04 | HGNC:12419 | compara_orths.tsv | ||
SPBC1703.14c | HGNC:29787 | compara_orths.tsv | ||
SPBC1703.14c | HGNC:11986 | 2010-02-17 | human_orthologs_from_contigs.tsv | |
SPBC19F8.03c | HGNC:15514 | 2010-02-17 | human_orthologs_from_contigs.tsv | |
SPBC19F8.03c | HGNC:14986 | compara_orths.tsv | ||
SPBC2G2.12 | HGNC:4816 | compara_orths.tsv | ||
SPBC2G2.12 | HGNC:4817 | 2011-10-26 | human_orthologs_from_contigs.tsv | |
SPBC1604.21c | HGNC:12471 | compara_orths.tsv | ||
SPBC1604.21c | HGNC:12469 | 2008-04-14 | human_orthologs_from_contigs.tsv | |
SPBC1347.12 | HGNC:167 | 2010-07-02 | human_orthologs_from_contigs.tsv | |
SPBC1347.12 | HGNC:168 | compara_orths.tsv | ||
SPCC1682.07 | HGNC:4656 | 2011-01-13 | human_orthologs_from_contigs.tsv | |
SPCC1682.07 | HGNC:31394 | compara_orths.tsv |
We can just delete the dates from those.
No referneces? I woudl check if they could apply to all, or find a replacement
No referneces? I woudl check if they could apply to all, or find a replacement
Only one has a reference. You may need to scroll the table to the right:
SPAC25H1.06 HGNC:9890 compara_orths.tsv
SPAC25H1.06 HGNC:9887 2009-07-23 human_orthologs_from_contigs.tsv PMID:18761674
There are three genes missed from the previous because there is no date on the annotation:
pombe | human | file |
---|---|---|
SPAPB17E12.11 | HGNC:28880 | compara_orths.tsv |
SPAPB17E12.11 | HGNC:30242 | human_orthologs_from_contigs.tsv |
SPBC106.11c | HGNC:8579 | compara_orths.tsv |
SPBC106.11c | HGNC:9040 | human_orthologs_from_contigs.tsv |
SPBC2D10.12 | HGNC:9813 | compara_orths.tsv |
SPBC2D10.12 | HGNC:9812 | human_orthologs_from_contigs.tsv |
SPAPB17E12.11 MAGT1 ,TUSC3 PANTHER:PTHR12692 2024-08-22 SPBC106.11c PAFAH2,PLA2G7 HGNC:9040 PANTHER:PTHR10272 2024-08-22 SPBC2D10.12 RAD23S,RAD23B PANTHER:PTHR10621 2024-08-22
I haven't edited the files because I'm not sure what stage you are at (In all cases the pANTHER REF supports the 1:2 mapping)
We can remove the reference PMID:18761674 from SPAC25H1.06 they only point out that it is best hit so we don't lose important info by removing it, and it makes it easier to keep them aligned.
we can instead use PANTHER:PTHR22850 to support SPAC25H1.06 and SPCC1672.10 and SPAC29A4.18
We can remove the reference PMID:18761674 from SPAC25H1.06
OK, I've done that.
I'm planning to change the HGNC: IDs to names in the files in the orthologs directory today. I'll let you know how that goes.
In the download human used IDs (I.e. HGNC:18585 ). This is the closest we have to a locus tag/systematic identifier (human don't have these)
So we should change the exported file to use IDs instead of names?
The exported file already uses the IDs.
I was looking at the main export file: https://curation.pombase.org/dumps/builds/pombase-build-2024-08-22/pombase-latest.human-orthologs.txt.gz
We have a file with the IDs too: https://curation.pombase.org/dumps/builds/pombase-build-2024-08-22/exports/pombe-human-orthologs-with-systematic-ids.txt.gz
Should we change the main export file to use IDs?
I'm looking in here (the directory linked from the ortholog download page). Bit I see we have both in here. We should decide on Tuesday
Yep, let's chat about it on Tuesday.
In the new release directory structure we have the file with the gene names: https://www.pombase.org/public_releases/pombase-2024-06-01/curated_orthologs/
can we change the curated files to SPAC6G9.04 PSD,PSD2,PSD3,PSD4
I'm doing that now and changing IDs to names at the same time (in human_orthologs_from_contigs.tsv
and compara_orths.tsv
.
Should I combine the 4 human ortholog files into one while I do that?
Here's how the human ortholog file would look after fixing the names and combining the 4 files: https://curation.pombase.org/kmr44/new_orths_table.tsv
Should we change the IDs to names in pombe-embl/orthologs/cerevisiae_orthologs_from_contigs.tsv
?
Looks good. This makes me happy, only one place to go!
Should we change the IDs to names in pombe-embl/orthologs/cerevisiae_orthologs_from_contigs.tsv?
I hadn't though about that, but I think it would be easier for curation...
I hadn't though about that, but I think it would be easier for curation...
OK, I'll change that file and combine the human ortholog files (with IDs substituted for names). I'll double check tomorrow that nothing has gone too wrong.
I'll double check tomorrow that nothing has gone too wrong.
I ran it on my desktop and it all worked fine so I think it will be OK on the main site in the morning.
All fixed and released now.
I've moved the old files into pombe-embl/orthologs/old_orthologs
.
https://www.pombase.org/gene/SPBC215.05
This is suspicious because gpd2 has the same human orthologs but the references are there: https://www.pombase.org/gene/SPAC23D3.04c