Process Alliance of Genome Resources Disease dataset

jeet-vora commented 11 months ago

Process this after processing EBI files (#840)

The previous Alliance of Genome Resources Disease dataset were for mouse and rat only and dataset without info header. The previous file was csv and now it is tsv. Also two additional fields ExperimentalCondition | Modifierhave been added to the source file but the overall format and content remains the same.

The new files are for rat, mouse, fruitfly and yeast in tsv. Changes have been made in the dataset_masterlist.json

Processing instructions

Source file folder path - /data/projects/glygen/downloads/alliancegenome/disease/current (disease folder has been created to distinguish between disease and orthology)_

Source input files - *_disease_genome_alliance.tsv

Other input files - mouse_protein_xref_mgi.csv | rat_protein_xref_rgd.csv | yeast_protein_xref_sgd.csv | dicty_protein_xref_dictybase.csv

Output files - *_disease_genome_alliance.csv

Step 1: Remove the header from tsv

##########################################################################
--
#
# Data type: Disease
# Data format: tsv
# README:
# Source: Alliance of Genome Resources   (Alliance)
# Source URL:   http://alliancegenome.org/downloads
# Help Desk: help@alliancegenome.org
# Orthology Filter: Stringent
# Taxon IDs: NCBITaxon:10090
# Species: Mus musculus
# Alliance Database Version: 6.0.0
# Date file generated (UTC): 2023-10-04   21:17
#
##########################################################################

Step 2: Process the dataset as Rahi mentioned before to create output like below

uniprotkb_canonical_ac	xref_key	xref_id	do_id	mondo_id	mim_id
P43245-1	protein_xref_genome_alliance	1824	1824
A0A0G2JVD3-1	protein_xref_genome_alliance	10763	10763
Q62666-1	protein_xref_genome_alliance	365	365
G3V9R2-1	protein_xref_genome_alliance	670	670
P51638-1	protein_xref_genome_alliance	5844	5844	0012058	608557
P11167-1	protein_xref_genome_alliance	3525	3525
P14844-1	protein_xref_genome_alliance	9477	9477
P98106-1	protein_xref_genome_alliance	1936	1936
A7VJC2-1	protein_xref_genome_alliance	684	684	0007256	114550
A0A8I6G721-1	protein_xref_genome_alliance	11476	11476	0005298	166710

Step 3 Can you do a overlapping analysis to see if the DOID present in the source files are part of GlyGen DOID collection in protein_disease_idmap?

FYI @JingyueWu

rykahsay commented 10 months ago

@jeet-vora --> you are not following download folder rules (current should be after resource and you should set the right permissions)

jeet-vora commented 10 months ago

@rykahsay I will have to talk to you on Monday about folder structure. The alliance genome needs to have two folders one for orthology and the other disease. The current folder was in disease folder as mentioned in the ticket with 775 permissions.

rykahsay commented 10 months ago

Done, check files:

ls -ltr unreviewed/*protein_disease_alliance_genome.csv
-rw-r--r--. 1 rykahsay glygen 2571444 Nov 30 14:43 unreviewed/mouse_protein_disease_alliance_genome.csv
-rw-r--r--. 1 rykahsay glygen 2469773 Nov 30 14:43 unreviewed/rat_protein_disease_alliance_genome.csv
-rw-r--r--. 1 rykahsay glygen 1699393 Nov 30 14:43 unreviewed/fruitfly_protein_disease_alliance_genome.csv
-rw-r--r--. 1 rykahsay glygen  625938 Nov 30 14:43 unreviewed/yeast_protein_disease_alliance_genome.csv

Overlap analysis:

wc /tmp/doid-in-*
 1475  1475  8467 /tmp/doid-in-alliance-genome-only.txt
 4566  4566 34920 /tmp/doid-in-both.txt
  722   721  5501 /tmp/doid-in-idmap-only.txt

jeet-vora commented 9 months ago

@rykahsay

Please provide the full path to the files - /tmp/doid-in-* There are many tmp folders spread across.

rykahsay commented 9 months ago

$ wc /data/projects/glygen/generated/misc/doid-in-*
 1475  1475  8467 generated/misc/doid-in-alliance-genome-only.txt
 4566  4566 34920 generated/misc/doid-in-both.txt
  722   721  5501 generated/misc/doid-in-idmap-only.txt

jeet-vora commented 9 months ago

@rykahsay

I have detected an issue with the DO source file. See below

The file /data/projects/glygen/downloads/do/2023_10_10/doid.owl from the previous version was not getting downloaded correctly. I have downloaded a new version and have marked it as current. Please reprocess
However, the doid.owl is currently not being used by protein_disease_idmap nor by protein_disease_names as per usage. Can you look into it and let me know if there is any other file using this DO file. It should be used by the above files.
I think we can do the overlap analysis once this issue is resolved.

@JingyueWu @Luke-Johnson-5

rykahsay commented 9 months ago

I have updated protein_disease_idmap and protein_disease_names . Here is the workflow (you may need to document this) on how the DO download is processed/used:

Step-1: Convert downloaded doid.owl to disease.nt format
/data/projects/glygen/downloads/do/current/doid.owl --> /data/projects/glygen/generated/sparql/disease/disease.nt

Step-2: Load disease.nt to triple store (just like nt files from EBI are loaded to triple store)
 /data/projects/glygen/generated/sparql/disease/disease.nt --> Virtuoso triple store database

Step-3: Use triples in Virtuoso to make protein_disease_idmap and protein_disease_names datasets

glygener / glygen-issues

Process Alliance of Genome Resources Disease dataset #853