Open ValWood opened 4 years ago
so it would be good to filter them over time
Do we just need to remove MalaCards annotations where there is an identical PomBase annotation?
Yes please. The fewer Malacards the better because although the coverage is good there are a lot of false positives (probably semi-automated) . I know if we have done them they are correct.
It's also fewer places to fix if MONDO changes.
OK. I'll have a think about how to do that. It will need to check each ortholog if there is more than one and then warn unless the disease association is the same for each pombe gene.
I'm not so bothered about the duplicates. If everything is in one place it will be so much easier to manage.
I've added code to load the human gene name into Chado so it's now easy to make a table with all the details from both disease input files. It should make it easier to spot duplicates and to see which input file an annotation came from - if there is an ID in the human_gene
column then the annotation comes from the Malacards data.
Here's what it looks like, sorted by pombe gene ID: all_disease_assoc_with_gene_names.tsv.txt
Note to self, generated with this query:
SELECT f.uniquename AS systematic_id,
f.name AS pombe_gene_name,
(SELECT value
FROM feature_cvtermprop p
JOIN cvterm pt ON pt.cvterm_id = p.type_id
WHERE p.feature_cvterm_id = fc.feature_cvterm_id
AND pt.name = 'malacards_human_source_gene') AS human_gene,
pub.uniquename AS pmid,
db.name || ':' || x.accession AS mondo_term_id,
t.name AS term_name,
(SELECT value
FROM cvtermprop p
JOIN cvterm pt ON pt.cvterm_id = p.type_id
WHERE p.cvterm_id = t.cvterm_id
AND pt.name = 'malacards_disease_name') AS malacards_disease_name
FROM feature_cvterm fc
JOIN feature f ON f.feature_id = fc.feature_id
JOIN cvterm t ON t.cvterm_id = fc.cvterm_id
JOIN dbxref x ON t.dbxref_id = x.dbxref_id
JOIN db ON x.db_id = db.db_id
JOIN cv ON cv.cv_id = t.cv_id
JOIN pub ON fc.pub_id = pub.pub_id
WHERE cv.name = 'mondo'
ORDER BY f.uniquename, pub.uniquename, t.name;
thanks for this. Its not such a big deal, I can chip away slowly at the Malacards false positives, and I cam probably get rid of any remaining issues when I do the Alliance comparison.
But if it is easy to remove all Malacards duplicates that would be cool....
I think this will be obsoete by other disease work proposed , Ill take this over to the curation tracker
Remove some redundant Malacards which already have a PomBase entry.
I am finding quite a few misassignments in these I trust the ones we have checked ourselves more, so it would be good to filter them over time (for example as we add good publications connected to disease modelling). How easy would that be?