glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

Process Alliance of Genome Resources Disease dataset #853

Open jeet-vora opened 11 months ago

jeet-vora commented 11 months ago

The previous Alliance of Genome Resources Disease dataset were for mouse and rat only and dataset without info header. The previous file was csv and now it is tsv. Also two additional fields ExperimentalCondition | Modifierhave been added to the source file but the overall format and content remains the same.

The new files are for rat, mouse, fruitfly and yeast in tsv. Changes have been made in the dataset_masterlist.json

Processing instructions

Source file folder path - /data/projects/glygen/downloads/alliancegenome/disease/current (disease folder has been created to distinguish between disease and orthology)_

Source input files - *_disease_genome_alliance.tsv

Other input files - mouse_protein_xref_mgi.csv | rat_protein_xref_rgd.csv | yeast_protein_xref_sgd.csv | dicty_protein_xref_dictybase.csv

Output files - *_disease_genome_alliance.csv

Step 1: Remove the header from tsv

##########################################################################
--
#
# Data type: Disease
# Data format: tsv
# README:
# Source: Alliance of Genome Resources   (Alliance)
# Source URL:   http://alliancegenome.org/downloads
# Help Desk: help@alliancegenome.org
# Orthology Filter: Stringent
# Taxon IDs: NCBITaxon:10090
# Species: Mus musculus
# Alliance Database Version: 6.0.0
# Date file generated (UTC): 2023-10-04   21:17
#
##########################################################################

Step 2: Process the dataset as Rahi mentioned before to create output like below

uniprotkb_canonical_ac xref_key xref_id do_id mondo_id mim_id
P43245-1 protein_xref_genome_alliance 1824 1824
A0A0G2JVD3-1 protein_xref_genome_alliance 10763 10763
Q62666-1 protein_xref_genome_alliance 365 365
G3V9R2-1 protein_xref_genome_alliance 670 670
P51638-1 protein_xref_genome_alliance 5844 5844 0012058 608557
P11167-1 protein_xref_genome_alliance 3525 3525
P14844-1 protein_xref_genome_alliance 9477 9477
P98106-1 protein_xref_genome_alliance 1936 1936
A7VJC2-1 protein_xref_genome_alliance 684 684 0007256 114550
A0A8I6G721-1 protein_xref_genome_alliance 11476 11476 0005298 166710

Step 3 Can you do a overlapping analysis to see if the DOID present in the source files are part of GlyGen DOID collection in protein_disease_idmap?

FYI @JingyueWu

rykahsay commented 10 months ago

@jeet-vora --> you are not following download folder rules (current should be after resource and you should set the right permissions)

jeet-vora commented 10 months ago

@rykahsay I will have to talk to you on Monday about folder structure. The alliance genome needs to have two folders one for orthology and the other disease. The current folder was in disease folder as mentioned in the ticket with 775 permissions.

rykahsay commented 10 months ago

Done, check files:

ls -ltr unreviewed/*protein_disease_alliance_genome.csv
-rw-r--r--. 1 rykahsay glygen 2571444 Nov 30 14:43 unreviewed/mouse_protein_disease_alliance_genome.csv
-rw-r--r--. 1 rykahsay glygen 2469773 Nov 30 14:43 unreviewed/rat_protein_disease_alliance_genome.csv
-rw-r--r--. 1 rykahsay glygen 1699393 Nov 30 14:43 unreviewed/fruitfly_protein_disease_alliance_genome.csv
-rw-r--r--. 1 rykahsay glygen  625938 Nov 30 14:43 unreviewed/yeast_protein_disease_alliance_genome.csv

Overlap analysis:

wc /tmp/doid-in-*
 1475  1475  8467 /tmp/doid-in-alliance-genome-only.txt
 4566  4566 34920 /tmp/doid-in-both.txt
  722   721  5501 /tmp/doid-in-idmap-only.txt
jeet-vora commented 9 months ago

@rykahsay

Please provide the full path to the files - /tmp/doid-in-* There are many tmp folders spread across.

rykahsay commented 9 months ago
$ wc /data/projects/glygen/generated/misc/doid-in-*
 1475  1475  8467 generated/misc/doid-in-alliance-genome-only.txt
 4566  4566 34920 generated/misc/doid-in-both.txt
  722   721  5501 generated/misc/doid-in-idmap-only.txt
jeet-vora commented 9 months ago

@rykahsay

I have detected an issue with the DO source file. See below

@JingyueWu @Luke-Johnson-5

rykahsay commented 9 months ago

I have updated protein_disease_idmap and protein_disease_names . Here is the workflow (you may need to document this) on how the DO download is processed/used:

Step-1: Convert downloaded doid.owl to disease.nt format
/data/projects/glygen/downloads/do/current/doid.owl --> /data/projects/glygen/generated/sparql/disease/disease.nt

Step-2: Load disease.nt to triple store (just like nt files from EBI are loaded to triple store)
 /data/projects/glygen/generated/sparql/disease/disease.nt --> Virtuoso triple store database

Step-3: Use triples in Virtuoso to make protein_disease_idmap and protein_disease_names datasets