Open jeet-vora opened 11 months ago
@jeet-vora --> you are not following download folder rules (current should be after resource and you should set the right permissions)
@rykahsay I will have to talk to you on Monday about folder structure. The alliance genome needs to have two folders one for orthology
and the other disease
. The current folder was in disease folder as mentioned in the ticket with 775 permissions.
Done, check files:
ls -ltr unreviewed/*protein_disease_alliance_genome.csv
-rw-r--r--. 1 rykahsay glygen 2571444 Nov 30 14:43 unreviewed/mouse_protein_disease_alliance_genome.csv
-rw-r--r--. 1 rykahsay glygen 2469773 Nov 30 14:43 unreviewed/rat_protein_disease_alliance_genome.csv
-rw-r--r--. 1 rykahsay glygen 1699393 Nov 30 14:43 unreviewed/fruitfly_protein_disease_alliance_genome.csv
-rw-r--r--. 1 rykahsay glygen 625938 Nov 30 14:43 unreviewed/yeast_protein_disease_alliance_genome.csv
Overlap analysis:
wc /tmp/doid-in-*
1475 1475 8467 /tmp/doid-in-alliance-genome-only.txt
4566 4566 34920 /tmp/doid-in-both.txt
722 721 5501 /tmp/doid-in-idmap-only.txt
@rykahsay
Please provide the full path to the files - /tmp/doid-in-* There are many tmp folders spread across.
$ wc /data/projects/glygen/generated/misc/doid-in-*
1475 1475 8467 generated/misc/doid-in-alliance-genome-only.txt
4566 4566 34920 generated/misc/doid-in-both.txt
722 721 5501 generated/misc/doid-in-idmap-only.txt
@rykahsay
I have detected an issue with the DO source file. See below
doid.owl
from the previous version was not getting downloaded correctly. I have downloaded a new version and have marked it as current. Please reprocessdoid.owl
is currently not being used by protein_disease_idmap
nor by protein_disease_names
as per usage. Can you look into it and let me know if there is any other file using this DO file. It should be used by the above files.@JingyueWu @Luke-Johnson-5
I have updated protein_disease_idmap and protein_disease_names . Here is the workflow (you may need to document this) on how the DO download is processed/used:
Step-1: Convert downloaded doid.owl to disease.nt format
/data/projects/glygen/downloads/do/current/doid.owl --> /data/projects/glygen/generated/sparql/disease/disease.nt
Step-2: Load disease.nt to triple store (just like nt files from EBI are loaded to triple store)
/data/projects/glygen/generated/sparql/disease/disease.nt --> Virtuoso triple store database
Step-3: Use triples in Virtuoso to make protein_disease_idmap and protein_disease_names datasets
The previous Alliance of Genome Resources Disease dataset were for
mouse
andrat
only and dataset without info header. The previous file was csv and now it is tsv. Also two additional fieldsExperimentalCondition | Modifier
have been added to the source file but the overall format and content remains the same.The new files are for rat, mouse, fruitfly and yeast in tsv. Changes have been made in the dataset_masterlist.json
Processing instructions
Source file folder path - /data/projects/glygen/downloads/alliancegenome/disease/current (disease folder has been created to distinguish between disease and orthology)_
Source input files - *_disease_genome_alliance.tsv
Other input files - mouse_protein_xref_mgi.csv | rat_protein_xref_rgd.csv | yeast_protein_xref_sgd.csv | dicty_protein_xref_dictybase.csv
Output files - *_disease_genome_alliance.csv
Step 1: Remove the header from tsv
Step 2: Process the dataset as Rahi mentioned before to create output like below
Step 3 Can you do a overlapping analysis to see if the DOID present in the source files are part of GlyGen DOID collection in protein_disease_idmap?
FYI @JingyueWu