pombase / pombase-chado

PomBase code for accessing Chado
MIT License
5 stars 3 forks source link

Move orthologs out of contig files #993

Closed kimrutherford closed 1 month ago

kimrutherford commented 2 years ago

From https://github.com/pombase/curation/issues/3268#issuecomment-1154199303

some time we should get everything ortholog related out of Artemis and into one place. Maybe we should do this at the same time as the disease annotation is streamlined. Don't start it yet though. We can wait until my pending deadlines are done and the GI interface is finished. Also before we start this I want to get the genome resubmission done. It will look better for the GBDR application if the data sharing is up to date. Also, after this week there are unlikely to be so many updates of either diseases or orthologs for a while.

kimrutherford commented 1 month ago

I started looking into this but there are a couple of tasks to do first.

Some of the orthologs in the contigs have db_xref qualifiers with PMIDs. These are stored in Chado and displayed on the website. An example: https://www.pombase.org/gene/SPAC23G3.08c

FT                   /controlled_curation="term=human USP45 ortholog;
FT                   db_xref=PMID:20838651; date=20101001"

The code the reads the files from pombe-embl/orthologs doesn't support db_xrefs so the code will need updating.

The second problem is that there are some extra qualifiers that we store in Chado but which we don't have columns fo in the ortholog files

FT                   /controlled_curation="term=human NUBP2 ortholog;
FT                   qualifier=SPAC806.02c,N-term; date=20140831"

I've written the human orthologs from the contigs to this file in case it's useful: orthologs/human_orthologs_from_contigs.txt

The code the writes the ortholog files doesn't support dbxrefs or qualifiers so this new file doesn't have them yet.

ValWood commented 1 month ago

Ah OK I probab ly don't need to do this: https://github.com/pombase/curation/issues/3455 It seems that it is already done, just that the info isn't in the file

kimrutherford commented 1 month ago

Ah OK I probably don't need to do this: https://github.com/pombase/curation/issues/3455

If they're in the contig files, they'll be loaded.

kimrutherford commented 1 month ago

Hi Val.

Should we use gene names or systematic IDs for human and cerevisiae in the new TSV files?


Note to self:

Query all orthologs from Chado, with qualifiers, references and dates:

SELECT obj.uniquename, COALESCE(subj.name, subj.uniquename), subj_organism.species,
  (SELECT value FROM feature_relationshipprop qual_p
   WHERE qual_p.type_id IN
       (SELECT cvterm_id FROM cvterm WHERE name = 'ortholog_qualifier')
     AND qual_p.feature_relationship_id = rel.feature_relationship_id) AS qualifier,
  (SELECT pub.uniquename FROM pub
   JOIN feature_relationship_pub fcp ON rel.feature_relationship_id = fcp.feature_relationship_id
   WHERE pub.pub_id = fcp.pub_id) AS ref_uniquename,
  (SELECT value FROM feature_relationshipprop qual_p
   WHERE qual_p.type_id IN
       (SELECT cvterm_id FROM cvterm WHERE name = 'date')
     AND qual_p.feature_relationship_id = rel.feature_relationship_id) AS date
FROM feature_relationship rel
JOIN feature subj ON subj.feature_id = rel.subject_id
JOIN feature obj ON obj.feature_id = rel.object_id
JOIN organism subj_organism on subj_organism.organism_id = subj.organism_id
WHERE rel.type_id IN (SELECT cvterm_id FROM cvterm WHERE name = 'orthologous_to');
ValWood commented 1 month ago

I think that makes sense.

kimrutherford commented 1 month ago

I think that makes sense.

Which option should we go for? Gene names or IDs?

ValWood commented 1 month ago

IDs. I like names, but IDs makes more sense...

ValWood commented 1 month ago

...in downloads

kimrutherford commented 1 month ago

OK, thanks.

So will be "YOR387C" etc. and "HGNC:5211" etc.

kimrutherford commented 1 month ago

Perl to remove the cerevisiae orthologs from the contig files:

(for i in  *.contig
do
perl -ne 'BEGIN { $orth = ""; $in = 0; } if ($orth && !$in) { warn "$orth\n"; $orth = "" } if (/controlled_curation="term=orthologous to S. cerevisiae/) { $in = 1; } if ($in) { s/^FT\s+//; chomp $_; $orth .= "$_ "; } else { print }; if ($in && /"$/) { $in = 0 }' $i > $i.new && mv $i.new $i
done) 2> cerevisiae_orth_curations.txt

Human orthologs:

(for i in  *.contig                 
do
perl -ne 'BEGIN { $orth = ""; $in = 0; } if ($orth && !$in) { warn "$orth\n"; $orth = "" } if (/controlled_curation="term=human / && !/family|retinoblastoma/) { $in = 1; } if ($in) { s/^FT\s+//; chomp $_; $orth .= "$_ "; } else { print }; if ($in && /"$/) { $in = 0 }' $i > $i.new && mv $i.new $i
done) 2> human-orths.txt
kimrutherford commented 1 month ago

We have qualifier=predicted on a few orthologs. They aren't displayed at the moment. Should we keep them when we move the orthologs?:

Example, SPCC1259.14c:

FT                   /controlled_curation="term=human CRISPLD2 ortholog;
FT                   qualifier=predicted; date=20100919"

Here's the full list: /controlled_curation="term=human TERF1 and TERF2 ortholog; qualifier=predicted; db_xref=PMID:20923774; date=20101008" /controlled_curation="term=human RECQL4 ortholog; qualifier=predicted; date=20100820" /controlled_curation="term=human LRRC57 ortholog; qualifier=predicted; date=20080414" /controlled_curation="term=human CRISPLD2 ortholog; qualifier=predicted; date=20100919"

ValWood commented 1 month ago

please re ove, they are all"predicted" really! at different levels of confidence...

kimrutherford commented 1 month ago

OK, thanks.

That was the last thing I wanted to check so I'm going to move the orthologs out of the contig this morning. If you have any contig file changes pending could you check them in and let me know when you're done?

kimrutherford commented 1 month ago

I forgot about this issue:

Most of it can be fixed after moving to the TSV file, except for:

SPAC1782.04/cox24 is like this with "(C-term)" / "(N-term)" after the gene ID:

FT                   /controlled_curation="term=orthologous to S. cerevisiae
FT                   YLR204W (C-term); date=19700101"
FT                   /controlled_curation="term=orthologous to S. cerevisiae
FT                   YNL295W (N-term); date=20120912"
ValWood commented 1 month ago

ok I'm up to date. Move them all except SPAC1782.04 and I'll migrate that manually later today

kimrutherford commented 1 month ago

Move them all except SPAC1782.04 and I'll migrate that manually later today

OK, I've done that and committed the changes to SVN. I'll have a look at things in the morning to make sure it's all OK.

kimrutherford commented 1 month ago

It all looks OK this morning. The ortholog export files are identical to yesterday:

ValWood commented 1 month ago

I just manually migrated SPAC1782.04. I have some tidying and checking to do for the migrated ones. That is on my list.