pombase / pombase-chado

PomBase code for accessing Chado
MIT License
5 stars 3 forks source link

Missing references for human orthologs of gpd1 #1200

Closed kimrutherford closed 3 months ago

kimrutherford commented 3 months ago

https://www.pombase.org/gene/SPBC215.05

This is suspicious because gpd2 has the same human orthologs but the references are there: https://www.pombase.org/gene/SPAC23D3.04c

ValWood commented 3 months ago

It is because they are in different files

conserved_multi.txt:SPBC215.05 GPD1,GPD1L compara_orths.tsv:SPAC23D3.04c HGNC:4455,HGNC:28956

But Compara isn't necessarily the source of these despite the names of the files. They have all been checked and many added or removed over time. I should create a PomBase reference for these.

Basically they were all manually curated (originally) , but we used Compara as a source of the 1:1's identifiers originally (because I did not record the human identifier)

However many were incorrect, and I had to check them all.

The different files are now largely nonsense as you can see they all have a mixture of 1:1 1:many and many:many so we could merge into a single file and drop the Campara reference.

Ideally we would have support from other sources, we can discuss this but it is a bit tricky. Panther would have the best coverage, but for many, there will be no supporting reference from an ortholog predictor. I try to add a reference for these.

I can explain the issue better on call tomorrow....

kimrutherford commented 3 months ago

Thanks for the explanation. I was worried that it might be a bug.

kimrutherford commented 3 months ago

so we could merge into a single file and drop the Campara reference.

I can see one Compara file:

What are the others?

ValWood commented 3 months ago

There is only one "Compara" file (this file has been heavily revised over time)

but there are 2 other files

compara_orths.tsv conserved_one_to_one.txt conserved_multi.txt and now this one human_orthologs_from_contigs.tsv

I think we could have all human orthologs in a single file. Separating them is largely historical from how the original files were created. In fact, conserved multi often has 1:1 and conserved one to one often has multi , they have usually been edited in the file they ended up in which was sometimes arbitrary.

ValWood commented 3 months ago

A couple of things

  1. I noticed that some of the files use HGNC IDs (I think you asked me about this earlier in the week, but I thought it was for the downloadable files). For maintaining the manual orthologs I would rather use the names. This way i) I get alerted in the logs when a name changes. This is useful because it often means that a human gene is characterised, and b) it is quicker because I am always using the names, and I don't need to go and look up the ID. c) it also helps me to see that clusters are sensible ie. SPBC83.07 KDM4D,KDM4B,KDM4A,KDM4C - although this doesn't always work, it's a good indicator

  2. I am confused because I expected the sum of these 4 files to be 3639 rows (i.e. the number of pombe genes with human orthologs) , but it's 4325. Even when I unique on genes it's still 3945?

kimrutherford commented 3 months ago

For maintaining the manual orthologs I would rather use the names.

No problem, I'll change them to names.

I am confused because I expected the sum of these 4 files to be 3639 rows (i.e. the number of pombe genes with human orthologs) , but it's 4325. Even when I unique on genes it's still 3945?

I get 3638 with this:

awk '{print $1}' conserved_*.txt compara_orths.tsv human_orthologs_from_contigs.tsv | grep -v uniquename | sort -u | wc -l

One ortholog isn't loading ("SPAC11E3.01c SRCAP"), so that number matches Chado which has 3637 human orthologs.

ValWood commented 3 months ago

I see, I was doing the query wrong, but also the new file has

human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:18730 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4751 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4747 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4753 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4752 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4755 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4761 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:13954 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4748 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4750 2021-08-27 human_orthologs_from_contigs.tsv:SPCC622.09 HGNC:4758 2021-08-27

can we change the curated files to SPAC6G9.04 PSD,PSD2,PSD3,PSD4

or it will be difficult to maintain (necessary) changes as things are resolved

especially for multis like SPBC4C3.09 GYG1,GYG2,GYG2P1 SPBC4C3.08 GYG1,GYG2,GYG2P1 SPAC5H10.12c GYG1,GYG2,GYG2P1

its easier to tell that the annotation is consistent

ValWood commented 3 months ago

fixing !

kimrutherford commented 3 months ago

can we change the curated files to SPAC6G9.04 PSD,PSD2,PSD3,PSD4

What should we do with the references, dates and qualifiers in those cases? In Chado there is a single annotation for each ortholog and each annotation can only one reference/publication.

I had a quick look. Mostly all the orthologs for each gene have either no reference or they have the Compara paper (PMID:19029536) as a reference. If we removed the Compara reference that would help things. But there are few still cases where there are multiple orthologs that have different references or dates. For example:

SPAC25H1.06   RBBP4    PMID:18761674    2009-07-23
SPAC25H1.06   RBBP7    PMID:19029536

In this case only one of the orthologs has a date (the one from the contigs):

SPBC660.15  HGNC:7330,HGNC:2683
SPBC660.15  HGNC:18585          2010-02-17

Would a mixture be OK? With most annotations on one line ("SPAC6G9.04 PSD,PSD2,PSD3,PSD4"), where the reference and dates are the same, and the exceptions split onto multiple lines?

Just so you know, the Compara reference PMID:19029536 is automatically attached to all orthologs from compara_orths.tsv when they are loaded into Chado.

kimrutherford commented 3 months ago

For maintaining the manual orthologs I would rather use the names.

Just to double check, when I moved the orthologs out of the contigs I asked about this and you thought IDs would be better:

IDs. I like names, but IDs makes more sense...

Did I misunderstand?

ValWood commented 3 months ago

No I did a U-turn. When I was thinking about the question I was thinking about the files we would export , and I guess these should be IDs rather than names (which I think they are?). But for curating names is MUCH easier and has other advantages too. (human readability)

ValWood commented 3 months ago

SOrry about that, I wasn't thinking about he curation part!

ValWood commented 3 months ago

What should we do with the references, dates and qualifiers in those cases?

Hmm, I thought that each gene only had curation in one place. Usually when I update I would always ensure that the existing annotations were still valid so the data is really a time stamp that applies to all of the orthologs in the set. (if there are multiple).

It is likely that any reference would apply to the set too. I can check this is the case. If ever a reference does not apply to all of them, then in these cases I can split the lines.

Does that make sense?

kimrutherford commented 3 months ago

But for curating names is MUCH easier and has other advantages too. (human readability)

OK, no problem. I'll change the files to use names tomorrow.

In the exporting, the cerevisiae file contains IDs (YPL041C etc.) but the human export file contains names.

There's a comment about the cerevisiae file here: https://github.com/pombase/pombase-chado/issues/960#issuecomment-1094707218

kimrutherford commented 3 months ago

Does that make sense?

Yep, that makes sense.

What should do about the dates, like in this case:

SPBC660.15  HGNC:7330,HGNC:2683
SPBC660.15  HGNC:18585          2010-02-17

Should the date apply to all three ortholog annotations?

ValWood commented 3 months ago

In the download human used IDs (I.e. HGNC:18585 ). This is the closest we have to a locus tag/systematic identifier (human don't have these)

We use a systematic identifier for pombe and S. cerevisiae, so I think we are as consistent as we can be for downloads.

ValWood commented 3 months ago

SPBC660.15 HGNC:7330,HGNC:2683 SPBC660.15 HGNC:18585 2010-02-17

I'm confused about this because it implies the orthologs were curated in different places. I thought I always curated them in the same place so there might be an issue here. I will check this one. Are there others like this?

ValWood commented 3 months ago

For this one SPBC660.15 MSI,MSI2 PANTHER:PTHR48032 2024-08-22 will solve.

Are there others like this?

kimrutherford commented 3 months ago

Are there others like this?

Here are all the cases where there is an ortholog with a date and the pombe gene also has an ortholog annotation without a date:

pombe human date file reference
SPAPB1A10.07c HGNC:32237,HGNC:18825 compara_orths.tsv
SPAPB1A10.07c HGNC:13464 2010-09-08 human_orthologs_from_contigs.tsv
SPAPB1A10.07c HGNC:23231 2010-09-08 human_orthologs_from_contigs.tsv
SPAPB1A10.07c HGNC:11699 2010-09-08 human_orthologs_from_contigs.tsv
SPACUNK4.08 HGNC:3010,HGNC:3590,HGNC:20823 compara_orths.tsv
SPACUNK4.08 HGNC:3009 2010-08-27 human_orthologs_from_contigs.tsv
SPAC20H4.03c HGNC:28277,HGNC:11614,HGNC:11615 compara_orths.tsv
SPAC20H4.03c HGNC:11612 2010-02-17 human_orthologs_from_contigs.tsv
SPBC660.15 HGNC:7330,HGNC:2683 compara_orths.tsv
SPBC660.15 HGNC:18585 2010-02-17 human_orthologs_from_contigs.tsv
SPBC23E6.09 HGNC:29012,HGNC:12638 compara_orths.tsv
SPBC23E6.09 HGNC:12637 2010-02-17 human_orthologs_from_contigs.tsv
SPAC13G6.04 HGNC:11817 2012-03-04 human_orthologs_from_contigs.tsv
SPAC13G6.04 HGNC:11818 compara_orths.tsv
SPAC13A11.04c HGNC:23086 compara_orths.tsv
SPAC13A11.04c HGNC:12621 2010-10-01 human_orthologs_from_contigs.tsv
SPAC1565.07c HGNC:30688 2010-09-19 human_orthologs_from_contigs.tsv
SPAC1565.07c HGNC:30689 compara_orths.tsv
SPAC5D6.13 HGNC:24882 compara_orths.tsv
SPAC5D6.13 HGNC:15452 2009-08-17 human_orthologs_from_contigs.tsv
SPAC23C11.17 HGNC:14648 compara_orths.tsv
SPAC23C11.17 HGNC:6556 2010-06-14 human_orthologs_from_contigs.tsv
SPAC25H1.06 HGNC:9890 compara_orths.tsv
SPAC25H1.06 HGNC:9887 2009-07-23 human_orthologs_from_contigs.tsv PMID:18761674
SPAC688.03c HGNC:28658 compara_orths.tsv
SPAC688.03c HGNC:467 2010-07-18 human_orthologs_from_contigs.tsv
SPAC688.11 HGNC:18415 compara_orths.tsv
SPAC688.11 HGNC:4913 2010-02-17 human_orthologs_from_contigs.tsv
SPAC959.02 HGNC:15751 compara_orths.tsv
SPAC959.02 HGNC:7641 2008-04-14 human_orthologs_from_contigs.tsv
SPAC8C9.11 HGNC:32479 compara_orths.tsv
SPAC8C9.11 HGNC:29488 2010-07-18 human_orthologs_from_contigs.tsv
SPAC19G12.07c HGNC:20155 compara_orths.tsv
SPAC19G12.07c HGNC:15923 2008-06-24 human_orthologs_from_contigs.tsv
SPAC4F10.15c HGNC:12735 compara_orths.tsv
SPAC4F10.15c HGNC:12731 2010-02-17 human_orthologs_from_contigs.tsv
SPBC800.10c HGNC:3419 2010-02-17 human_orthologs_from_contigs.tsv
SPBC800.10c HGNC:24634 compara_orths.tsv
SPBC902.04 HGNC:29243 compara_orths.tsv
SPBC902.04 HGNC:20327 2009-10-04 human_orthologs_from_contigs.tsv
SPBC354.03 HGNC:12757 2009-08-01 human_orthologs_from_contigs.tsv
SPBC354.03 HGNC:17826 compara_orths.tsv
SPBC119.06 HGNC:10603 2011-04-07 human_orthologs_from_contigs.tsv
SPBC119.06 HGNC:10604 compara_orths.tsv
SPBC337.13c HGNC:19901 compara_orths.tsv
SPBC337.13c HGNC:16963 2008-04-14 human_orthologs_from_contigs.tsv
SPBC19C2.12 HGNC:14517 compara_orths.tsv
SPBC19C2.12 HGNC:14027 2021-02-28 human_orthologs_from_contigs.tsv
SPBC28E12.06c HGNC:20751 2010-02-22 human_orthologs_from_contigs.tsv
SPBC28E12.06c HGNC:29323 compara_orths.tsv
SPBC32F12.04 HGNC:12417 2012-03-03 human_orthologs_from_contigs.tsv
SPBC32F12.04 HGNC:12419 compara_orths.tsv
SPBC1703.14c HGNC:29787 compara_orths.tsv
SPBC1703.14c HGNC:11986 2010-02-17 human_orthologs_from_contigs.tsv
SPBC19F8.03c HGNC:15514 2010-02-17 human_orthologs_from_contigs.tsv
SPBC19F8.03c HGNC:14986 compara_orths.tsv
SPBC2G2.12 HGNC:4816 compara_orths.tsv
SPBC2G2.12 HGNC:4817 2011-10-26 human_orthologs_from_contigs.tsv
SPBC1604.21c HGNC:12471 compara_orths.tsv
SPBC1604.21c HGNC:12469 2008-04-14 human_orthologs_from_contigs.tsv
SPBC1347.12 HGNC:167 2010-07-02 human_orthologs_from_contigs.tsv
SPBC1347.12 HGNC:168 compara_orths.tsv
SPCC1682.07 HGNC:4656 2011-01-13 human_orthologs_from_contigs.tsv
SPCC1682.07 HGNC:31394 compara_orths.tsv

inconsistent_dates.tsv.txt

ValWood commented 3 months ago

We can just delete the dates from those.

ValWood commented 3 months ago

No referneces? I woudl check if they could apply to all, or find a replacement

kimrutherford commented 3 months ago

No referneces? I woudl check if they could apply to all, or find a replacement

Only one has a reference. You may need to scroll the table to the right:

SPAC25H1.06     HGNC:9890       compara_orths.tsv   
SPAC25H1.06     HGNC:9887   2009-07-23  human_orthologs_from_contigs.tsv    PMID:18761674

There are three genes missed from the previous because there is no date on the annotation:

pombe human file
SPAPB17E12.11 HGNC:28880 compara_orths.tsv
SPAPB17E12.11 HGNC:30242 human_orthologs_from_contigs.tsv
SPBC106.11c HGNC:8579 compara_orths.tsv
SPBC106.11c HGNC:9040 human_orthologs_from_contigs.tsv
SPBC2D10.12 HGNC:9813 compara_orths.tsv
SPBC2D10.12 HGNC:9812 human_orthologs_from_contigs.tsv
ValWood commented 3 months ago

SPAPB17E12.11 MAGT1 ,TUSC3 PANTHER:PTHR12692 2024-08-22 SPBC106.11c PAFAH2,PLA2G7 HGNC:9040 PANTHER:PTHR10272 2024-08-22 SPBC2D10.12 RAD23S,RAD23B PANTHER:PTHR10621 2024-08-22

I haven't edited the files because I'm not sure what stage you are at (In all cases the pANTHER REF supports the 1:2 mapping)

ValWood commented 3 months ago

We can remove the reference PMID:18761674 from SPAC25H1.06 they only point out that it is best hit so we don't lose important info by removing it, and it makes it easier to keep them aligned.

we can instead use PANTHER:PTHR22850 to support SPAC25H1.06 and SPCC1672.10 and SPAC29A4.18

kimrutherford commented 3 months ago

We can remove the reference PMID:18761674 from SPAC25H1.06

OK, I've done that.

I'm planning to change the HGNC: IDs to names in the files in the orthologs directory today. I'll let you know how that goes.

In the download human used IDs (I.e. HGNC:18585 ). This is the closest we have to a locus tag/systematic identifier (human don't have these)

So we should change the exported file to use IDs instead of names?

ValWood commented 3 months ago

The exported file already uses the IDs.

kimrutherford commented 3 months ago

I was looking at the main export file: https://curation.pombase.org/dumps/builds/pombase-build-2024-08-22/pombase-latest.human-orthologs.txt.gz

We have a file with the IDs too: https://curation.pombase.org/dumps/builds/pombase-build-2024-08-22/exports/pombe-human-orthologs-with-systematic-ids.txt.gz

Should we change the main export file to use IDs?

ValWood commented 3 months ago

I'm looking in here (the directory linked from the ortholog download page). Bit I see we have both in here. We should decide on Tuesday

kimrutherford commented 3 months ago

Yep, let's chat about it on Tuesday.

In the new release directory structure we have the file with the gene names: https://www.pombase.org/public_releases/pombase-2024-06-01/curated_orthologs/

kimrutherford commented 3 months ago

can we change the curated files to SPAC6G9.04 PSD,PSD2,PSD3,PSD4

I'm doing that now and changing IDs to names at the same time (in human_orthologs_from_contigs.tsv and compara_orths.tsv.

Should I combine the 4 human ortholog files into one while I do that?

kimrutherford commented 3 months ago

Here's how the human ortholog file would look after fixing the names and combining the 4 files: https://curation.pombase.org/kmr44/new_orths_table.tsv

kimrutherford commented 3 months ago

Should we change the IDs to names in pombe-embl/orthologs/cerevisiae_orthologs_from_contigs.tsv?

ValWood commented 3 months ago

Looks good. This makes me happy, only one place to go!

ValWood commented 3 months ago

Should we change the IDs to names in pombe-embl/orthologs/cerevisiae_orthologs_from_contigs.tsv?

I hadn't though about that, but I think it would be easier for curation...

kimrutherford commented 3 months ago

I hadn't though about that, but I think it would be easier for curation...

OK, I'll change that file and combine the human ortholog files (with IDs substituted for names). I'll double check tomorrow that nothing has gone too wrong.

kimrutherford commented 3 months ago

I'll double check tomorrow that nothing has gone too wrong.

I ran it on my desktop and it all worked fine so I think it will be OK on the main site in the morning.

kimrutherford commented 3 months ago

All fixed and released now.

I've moved the old files into pombe-embl/orthologs/old_orthologs.