Missing references for human orthologs of gpd1

kimrutherford commented 3 months ago

https://www.pombase.org/gene/SPBC215.05

This is suspicious because gpd2 has the same human orthologs but the references are there: https://www.pombase.org/gene/SPAC23D3.04c

ValWood commented 3 months ago

It is because they are in different files

conserved_multi.txt:SPBC215.05 GPD1,GPD1L compara_orths.tsv:SPAC23D3.04c HGNC:4455,HGNC:28956

But Compara isn't necessarily the source of these despite the names of the files. They have all been checked and many added or removed over time. I should create a PomBase reference for these.

Basically they were all manually curated (originally) , but we used Compara as a source of the 1:1's identifiers originally (because I did not record the human identifier)

However many were incorrect, and I had to check them all.

The different files are now largely nonsense as you can see they all have a mixture of 1:1 1:many and many:many so we could merge into a single file and drop the Campara reference.

Ideally we would have support from other sources, we can discuss this but it is a bit tricky. Panther would have the best coverage, but for many, there will be no supporting reference from an ortholog predictor. I try to add a reference for these.

I can explain the issue better on call tomorrow....

kimrutherford commented 3 months ago

Thanks for the explanation. I was worried that it might be a bug.

kimrutherford commented 3 months ago

so we could merge into a single file and drop the Campara reference.

I can see one Compara file:

pombe-embl/orthologs/compara_orths.tsv

What are the others?

ValWood commented 3 months ago

There is only one "Compara" file (this file has been heavily revised over time)

but there are 2 other files

compara_orths.tsv conserved_one_to_one.txt conserved_multi.txt and now this one human_orthologs_from_contigs.tsv

I think we could have all human orthologs in a single file. Separating them is largely historical from how the original files were created. In fact, conserved multi often has 1:1 and conserved one to one often has multi , they have usually been edited in the file they ended up in which was sometimes arbitrary.

ValWood commented 3 months ago

A couple of things

I noticed that some of the files use HGNC IDs (I think you asked me about this earlier in the week, but I thought it was for the downloadable files). For maintaining the manual orthologs I would rather use the names. This way i) I get alerted in the logs when a name changes. This is useful because it often means that a human gene is characterised, and b) it is quicker because I am always using the names, and I don't need to go and look up the ID. c) it also helps me to see that clusters are sensible ie. SPBC83.07 KDM4D,KDM4B,KDM4A,KDM4C - although this doesn't always work, it's a good indicator
I am confused because I expected the sum of these 4 files to be 3639 rows (i.e. the number of pombe genes with human orthologs) , but it's 4325. Even when I unique on genes it's still 3945?

kimrutherford commented 3 months ago

For maintaining the manual orthologs I would rather use the names.

No problem, I'll change them to names.

I am confused because I expected the sum of these 4 files to be 3639 rows (i.e. the number of pombe genes with human orthologs) , but it's 4325. Even when I unique on genes it's still 3945?

I get 3638 with this:

awk '{print $1}' conserved_*.txt compara_orths.tsv human_orthologs_from_contigs.tsv | grep -v uniquename | sort -u | wc -l

One ortholog isn't loading ("SPAC11E3.01c SRCAP"), so that number matches Chado which has 3637 human orthologs.

ValWood commented 3 months ago

I see, I was doing the query wrong, but also the new file has

human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:18730 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4751 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4747 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4753 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4752 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4755 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4761 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:13954 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4748 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4750 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4758 2021-08-27

can we change the curated files to SPAC6G9.04 PSD,PSD2,PSD3,PSD4

or it will be difficult to maintain (necessary) changes as things are resolved

especially for multis like SPBC4C3.09 GYG1,GYG2,GYG2P1 SPBC4C3.08 GYG1,GYG2,GYG2P1 SPAC5H10.12c GYG1,GYG2,GYG2P1

its easier to tell that the annotation is consistent

ValWood commented 3 months ago

fixing !

[x] not enough columns at line 1 of line containing SPAC11E3.01c SRCAP - missing TAB character?

kimrutherford commented 3 months ago

can we change the curated files to SPAC6G9.04 PSD,PSD2,PSD3,PSD4

What should we do with the references, dates and qualifiers in those cases? In Chado there is a single annotation for each ortholog and each annotation can only one reference/publication.

I had a quick look. Mostly all the orthologs for each gene have either no reference or they have the Compara paper (PMID:19029536) as a reference. If we removed the Compara reference that would help things. But there are few still cases where there are multiple orthologs that have different references or dates. For example:

SPAC25H1.06   RBBP4    PMID:18761674    2009-07-23
SPAC25H1.06   RBBP7    PMID:19029536

In this case only one of the orthologs has a date (the one from the contigs):

SPBC660.15  HGNC:7330,HGNC:2683
SPBC660.15  HGNC:18585          2010-02-17

Would a mixture be OK? With most annotations on one line ("SPAC6G9.04 PSD,PSD2,PSD3,PSD4"), where the reference and dates are the same, and the exceptions split onto multiple lines?

Just so you know, the Compara reference PMID:19029536 is automatically attached to all orthologs from compara_orths.tsv when they are loaded into Chado.

kimrutherford commented 3 months ago

For maintaining the manual orthologs I would rather use the names.

Just to double check, when I moved the orthologs out of the contigs I asked about this and you thought IDs would be better:

https://github.com/pombase/pombase-chado/issues/993#issuecomment-2292865500

IDs. I like names, but IDs makes more sense...

Did I misunderstand?

ValWood commented 3 months ago

No I did a U-turn. When I was thinking about the question I was thinking about the files we would export , and I guess these should be IDs rather than names (which I think they are?). But for curating names is MUCH easier and has other advantages too. (human readability)

ValWood commented 3 months ago

SOrry about that, I wasn't thinking about he curation part!

ValWood commented 3 months ago

What should we do with the references, dates and qualifiers in those cases?

Hmm, I thought that each gene only had curation in one place. Usually when I update I would always ensure that the existing annotations were still valid so the data is really a time stamp that applies to all of the orthologs in the set. (if there are multiple).

It is likely that any reference would apply to the set too. I can check this is the case. If ever a reference does not apply to all of them, then in these cases I can split the lines.

Does that make sense?

kimrutherford commented 3 months ago

But for curating names is MUCH easier and has other advantages too. (human readability)

OK, no problem. I'll change the files to use names tomorrow.

In the exporting, the cerevisiae file contains IDs (YPL041C etc.) but the human export file contains names.

There's a comment about the cerevisiae file here: https://github.com/pombase/pombase-chado/issues/960#issuecomment-1094707218

kimrutherford commented 3 months ago

Does that make sense?

Yep, that makes sense.

What should do about the dates, like in this case:

SPBC660.15  HGNC:7330,HGNC:2683
SPBC660.15  HGNC:18585          2010-02-17

Should the date apply to all three ortholog annotations?

ValWood commented 3 months ago

In the download human used IDs (I.e. HGNC:18585 ). This is the closest we have to a locus tag/systematic identifier (human don't have these)

We use a systematic identifier for pombe and S. cerevisiae, so I think we are as consistent as we can be for downloads.

ValWood commented 3 months ago

SPBC660.15 HGNC:7330,HGNC:2683 SPBC660.15 HGNC:18585 2010-02-17

I'm confused about this because it implies the orthologs were curated in different places. I thought I always curated them in the same place so there might be an issue here. I will check this one. Are there others like this?

ValWood commented 3 months ago

For this one SPBC660.15 MSI,MSI2 PANTHER:PTHR48032 2024-08-22 will solve.

Are there others like this?

kimrutherford commented 3 months ago

Are there others like this?

Here are all the cases where there is an ortholog with a date and the pombe gene also has an ortholog annotation without a date:

pombe	human	date	file	reference
SPAPB1A10.07c	HGNC:32237,HGNC:18825		compara_orths.tsv
SPAPB1A10.07c	HGNC:13464	2010-09-08	human_orthologs_from_contigs.tsv
SPAPB1A10.07c	HGNC:23231	2010-09-08	human_orthologs_from_contigs.tsv
SPAPB1A10.07c	HGNC:11699	2010-09-08	human_orthologs_from_contigs.tsv
SPACUNK4.08	HGNC:3010,HGNC:3590,HGNC:20823		compara_orths.tsv
SPACUNK4.08	HGNC:3009	2010-08-27	human_orthologs_from_contigs.tsv
SPAC20H4.03c	HGNC:28277,HGNC:11614,HGNC:11615		compara_orths.tsv
SPAC20H4.03c	HGNC:11612	2010-02-17	human_orthologs_from_contigs.tsv
SPBC660.15	HGNC:7330,HGNC:2683		compara_orths.tsv
SPBC660.15	HGNC:18585	2010-02-17	human_orthologs_from_contigs.tsv
SPBC23E6.09	HGNC:29012,HGNC:12638		compara_orths.tsv
SPBC23E6.09	HGNC:12637	2010-02-17	human_orthologs_from_contigs.tsv
SPAC13G6.04	HGNC:11817	2012-03-04	human_orthologs_from_contigs.tsv
SPAC13G6.04	HGNC:11818		compara_orths.tsv
SPAC13A11.04c	HGNC:23086		compara_orths.tsv
SPAC13A11.04c	HGNC:12621	2010-10-01	human_orthologs_from_contigs.tsv
SPAC1565.07c	HGNC:30688	2010-09-19	human_orthologs_from_contigs.tsv
SPAC1565.07c	HGNC:30689		compara_orths.tsv
SPAC5D6.13	HGNC:24882		compara_orths.tsv
SPAC5D6.13	HGNC:15452	2009-08-17	human_orthologs_from_contigs.tsv
SPAC23C11.17	HGNC:14648		compara_orths.tsv
SPAC23C11.17	HGNC:6556	2010-06-14	human_orthologs_from_contigs.tsv
SPAC25H1.06	HGNC:9890		compara_orths.tsv
SPAC25H1.06	HGNC:9887	2009-07-23	human_orthologs_from_contigs.tsv	PMID:18761674
SPAC688.03c	HGNC:28658		compara_orths.tsv
SPAC688.03c	HGNC:467	2010-07-18	human_orthologs_from_contigs.tsv
SPAC688.11	HGNC:18415		compara_orths.tsv
SPAC688.11	HGNC:4913	2010-02-17	human_orthologs_from_contigs.tsv
SPAC959.02	HGNC:15751		compara_orths.tsv
SPAC959.02	HGNC:7641	2008-04-14	human_orthologs_from_contigs.tsv
SPAC8C9.11	HGNC:32479		compara_orths.tsv
SPAC8C9.11	HGNC:29488	2010-07-18	human_orthologs_from_contigs.tsv
SPAC19G12.07c	HGNC:20155		compara_orths.tsv
SPAC19G12.07c	HGNC:15923	2008-06-24	human_orthologs_from_contigs.tsv
SPAC4F10.15c	HGNC:12735		compara_orths.tsv
SPAC4F10.15c	HGNC:12731	2010-02-17	human_orthologs_from_contigs.tsv
SPBC800.10c	HGNC:3419	2010-02-17	human_orthologs_from_contigs.tsv
SPBC800.10c	HGNC:24634		compara_orths.tsv
SPBC902.04	HGNC:29243		compara_orths.tsv
SPBC902.04	HGNC:20327	2009-10-04	human_orthologs_from_contigs.tsv
SPBC354.03	HGNC:12757	2009-08-01	human_orthologs_from_contigs.tsv
SPBC354.03	HGNC:17826		compara_orths.tsv
SPBC119.06	HGNC:10603	2011-04-07	human_orthologs_from_contigs.tsv
SPBC119.06	HGNC:10604		compara_orths.tsv
SPBC337.13c	HGNC:19901		compara_orths.tsv
SPBC337.13c	HGNC:16963	2008-04-14	human_orthologs_from_contigs.tsv
SPBC19C2.12	HGNC:14517		compara_orths.tsv
SPBC19C2.12	HGNC:14027	2021-02-28	human_orthologs_from_contigs.tsv
SPBC28E12.06c	HGNC:20751	2010-02-22	human_orthologs_from_contigs.tsv
SPBC28E12.06c	HGNC:29323		compara_orths.tsv
SPBC32F12.04	HGNC:12417	2012-03-03	human_orthologs_from_contigs.tsv
SPBC32F12.04	HGNC:12419		compara_orths.tsv
SPBC1703.14c	HGNC:29787		compara_orths.tsv
SPBC1703.14c	HGNC:11986	2010-02-17	human_orthologs_from_contigs.tsv
SPBC19F8.03c	HGNC:15514	2010-02-17	human_orthologs_from_contigs.tsv
SPBC19F8.03c	HGNC:14986		compara_orths.tsv
SPBC2G2.12	HGNC:4816		compara_orths.tsv
SPBC2G2.12	HGNC:4817	2011-10-26	human_orthologs_from_contigs.tsv
SPBC1604.21c	HGNC:12471		compara_orths.tsv
SPBC1604.21c	HGNC:12469	2008-04-14	human_orthologs_from_contigs.tsv
SPBC1347.12	HGNC:167	2010-07-02	human_orthologs_from_contigs.tsv
SPBC1347.12	HGNC:168		compara_orths.tsv
SPCC1682.07	HGNC:4656	2011-01-13	human_orthologs_from_contigs.tsv
SPCC1682.07	HGNC:31394		compara_orths.tsv

inconsistent_dates.tsv.txt

ValWood commented 3 months ago

We can just delete the dates from those.

ValWood commented 3 months ago

No referneces? I woudl check if they could apply to all, or find a replacement

kimrutherford commented 3 months ago

No referneces? I woudl check if they could apply to all, or find a replacement

Only one has a reference. You may need to scroll the table to the right:

SPAC25H1.06     HGNC:9890       compara_orths.tsv   
SPAC25H1.06     HGNC:9887   2009-07-23  human_orthologs_from_contigs.tsv    PMID:18761674

There are three genes missed from the previous because there is no date on the annotation:

pombe	human	file
SPAPB17E12.11	HGNC:28880	compara_orths.tsv
SPAPB17E12.11	HGNC:30242	human_orthologs_from_contigs.tsv
SPBC106.11c	HGNC:8579	compara_orths.tsv
SPBC106.11c	HGNC:9040	human_orthologs_from_contigs.tsv
SPBC2D10.12	HGNC:9813	compara_orths.tsv
SPBC2D10.12	HGNC:9812	human_orthologs_from_contigs.tsv

ValWood commented 3 months ago

SPAPB17E12.11 MAGT1 ,TUSC3 PANTHER:PTHR12692 2024-08-22 SPBC106.11c PAFAH2,PLA2G7 HGNC:9040 PANTHER:PTHR10272 2024-08-22 SPBC2D10.12 RAD23S,RAD23B PANTHER:PTHR10621 2024-08-22

I haven't edited the files because I'm not sure what stage you are at (In all cases the pANTHER REF supports the 1:2 mapping)

ValWood commented 3 months ago

We can remove the reference PMID:18761674 from SPAC25H1.06 they only point out that it is best hit so we don't lose important info by removing it, and it makes it easier to keep them aligned.

we can instead use PANTHER:PTHR22850 to support SPAC25H1.06 and SPCC1672.10 and SPAC29A4.18

kimrutherford commented 3 months ago

We can remove the reference PMID:18761674 from SPAC25H1.06

OK, I've done that.

I'm planning to change the HGNC: IDs to names in the files in the orthologs directory today. I'll let you know how that goes.

In the download human used IDs (I.e. HGNC:18585 ). This is the closest we have to a locus tag/systematic identifier (human don't have these)

So we should change the exported file to use IDs instead of names?

ValWood commented 3 months ago

The exported file already uses the IDs.

kimrutherford commented 3 months ago

I was looking at the main export file: https://curation.pombase.org/dumps/builds/pombase-build-2024-08-22/pombase-latest.human-orthologs.txt.gz

We have a file with the IDs too: https://curation.pombase.org/dumps/builds/pombase-build-2024-08-22/exports/pombe-human-orthologs-with-systematic-ids.txt.gz

Should we change the main export file to use IDs?

ValWood commented 3 months ago

I'm looking in here (the directory linked from the ortholog download page). Bit I see we have both in here. We should decide on Tuesday

kimrutherford commented 3 months ago

Yep, let's chat about it on Tuesday.

In the new release directory structure we have the file with the gene names: https://www.pombase.org/public_releases/pombase-2024-06-01/curated_orthologs/

kimrutherford commented 3 months ago

can we change the curated files to SPAC6G9.04 PSD,PSD2,PSD3,PSD4

I'm doing that now and changing IDs to names at the same time (in human_orthologs_from_contigs.tsv and compara_orths.tsv.

Should I combine the 4 human ortholog files into one while I do that?

kimrutherford commented 3 months ago

Here's how the human ortholog file would look after fixing the names and combining the 4 files: https://curation.pombase.org/kmr44/new_orths_table.tsv

kimrutherford commented 3 months ago

Should we change the IDs to names in pombe-embl/orthologs/cerevisiae_orthologs_from_contigs.tsv?

ValWood commented 3 months ago

Looks good. This makes me happy, only one place to go!

ValWood commented 3 months ago

Should we change the IDs to names in pombe-embl/orthologs/cerevisiae_orthologs_from_contigs.tsv?

I hadn't though about that, but I think it would be easier for curation...

kimrutherford commented 3 months ago

I hadn't though about that, but I think it would be easier for curation...

OK, I'll change that file and combine the human ortholog files (with IDs substituted for names). I'll double check tomorrow that nothing has gone too wrong.

kimrutherford commented 3 months ago

I'll double check tomorrow that nothing has gone too wrong.

I ran it on my desktop and it all worked fine so I think it will be OK on the main site in the morning.

kimrutherford commented 3 months ago

All fixed and released now.

I've moved the old files into pombe-embl/orthologs/old_orthologs.

pombase / pombase-chado

Missing references for human orthologs of gpd1 #1200