pombase / curation

PomBase curation
7 stars 0 forks source link

Remove some redundant Malacards which already have a PomBase entry. #3618

Open ValWood opened 4 years ago

ValWood commented 4 years ago

Remove some redundant Malacards which already have a PomBase entry.

I am finding quite a few misassignments in these I trust the ones we have checked ourselves more, so it would be good to filter them over time (for example as we add good publications connected to disease modelling). How easy would that be?

kimrutherford commented 3 years ago

so it would be good to filter them over time

Do we just need to remove MalaCards annotations where there is an identical PomBase annotation?

ValWood commented 3 years ago

Yes please. The fewer Malacards the better because although the coverage is good there are a lot of false positives (probably semi-automated) . I know if we have done them they are correct.

It's also fewer places to fix if MONDO changes.

kimrutherford commented 3 years ago

OK. I'll have a think about how to do that. It will need to check each ortholog if there is more than one and then warn unless the disease association is the same for each pombe gene.

ValWood commented 3 years ago

I'm not so bothered about the duplicates. If everything is in one place it will be so much easier to manage.

kimrutherford commented 3 years ago

I've added code to load the human gene name into Chado so it's now easy to make a table with all the details from both disease input files. It should make it easier to spot duplicates and to see which input file an annotation came from - if there is an ID in the human_gene column then the annotation comes from the Malacards data.

Here's what it looks like, sorted by pombe gene ID: all_disease_assoc_with_gene_names.tsv.txt

Note to self, generated with this query:

SELECT f.uniquename AS systematic_id,
       f.name AS pombe_gene_name,
  (SELECT value
   FROM feature_cvtermprop p
   JOIN cvterm pt ON pt.cvterm_id = p.type_id
   WHERE p.feature_cvterm_id = fc.feature_cvterm_id
     AND pt.name = 'malacards_human_source_gene') AS human_gene,
       pub.uniquename AS pmid,
       db.name || ':' || x.accession AS mondo_term_id,
       t.name AS term_name,
  (SELECT value
   FROM cvtermprop p
   JOIN cvterm pt ON pt.cvterm_id = p.type_id
   WHERE p.cvterm_id = t.cvterm_id
     AND pt.name = 'malacards_disease_name') AS malacards_disease_name
FROM feature_cvterm fc
JOIN feature f ON f.feature_id = fc.feature_id
JOIN cvterm t ON t.cvterm_id = fc.cvterm_id
JOIN dbxref x ON t.dbxref_id = x.dbxref_id
JOIN db ON x.db_id = db.db_id
JOIN cv ON cv.cv_id = t.cv_id
JOIN pub ON fc.pub_id = pub.pub_id
WHERE cv.name = 'mondo'
ORDER BY f.uniquename, pub.uniquename, t.name;
ValWood commented 3 years ago

thanks for this. Its not such a big deal, I can chip away slowly at the Malacards false positives, and I cam probably get rid of any remaining issues when I do the Alliance comparison.

ValWood commented 3 years ago

But if it is easy to remove all Malacards duplicates that would be cool....

ValWood commented 8 months ago

I think this will be obsoete by other disease work proposed , Ill take this over to the curation tracker