Closed cmalangone closed 3 years ago
The new list of commands are echo '"ensembl_id","hgnc_approved_symbol","uniprot_accessions","number_of_associations"' > 20.02_target_list.csv cat 20.02.1-search-data.json | jq -r 'select(.type=="target") | [.id, .approved_symbol, [.uniprot_accessions | join("|")][], .association_counts.total] | @csv' >> 20.02_target_list.csv cat 20.02.1-search-data.json | jq -r 'select(.type=="target") | {"ensembl_id": .id, "hgnc_approved_symbol": .approved_symbol, "uniprot_accessions": .uniprot_accessions, "number_of_associations": .association_counts.total}' > 20.02_target_list.json
echo '"efo_id","disease_full_name","number_of_associations"' > 20.02_disease_list.csv cat 20.02.1-search-data.json | jq -r 'select(.type=="disease") | [.id, .full_name, .association_counts.total] | @csv' >> 20.02_disease_list.csv cat 20.02.1-search-data.json | jq -c 'select(.type=="disease") | {"efo_id": .id, "disease_full_name": .full_name, "number_of_associations": .association_counts.total}' > 20.02_disease_list.json
echo '"ensembl_id","hgnc_approved_symbol","uniprot_accessions","number_of_associations"' > 20.09_target_list.csv cat 20.09_search-data.json | jq -r 'select(.type=="target") | [.id, .approved_symbol, [.uniprot_accessions | join("|")][], .association_counts.total] | @csv' >> 20.09_target_list.csv cat 20.09_search-data.json | jq -r 'select(.type=="target") | {"ensembl_id": .id, "hgnc_approved_symbol": .approved_symbol, "uniprot_accessions": .uniprot_accessions, "number_of_associations": .association_counts.total}' > 20.09_target_list.json
echo '"efo_id","disease_full_name","number_of_associations"' > 20.09_disease_list.csv cat 20.09_search-data.json | jq -r 'select(.type=="disease") | [.id, .full_name, .association_counts.total] | @csv' >> 20.09_disease_list.csv cat 20.09_search-data.json | jq -c 'select(.type=="disease") | {"efo_id": .id, "disease_full_name": .full_name, "number_of_associations": .association_counts.total}' > 20.09_disease_list.json
Gzip the output files. Copy to the proper GS Change the header of the files in the google storage
echo '"ensembl_id","hgnc_approved_symbol","uniprot_accessions","number_of_associations"' > 20.11_target_list.csv cat 20.11_search-data.json | jq -r 'select(.type=="target") | [.id, .approved_symbol, [.uniprot_accessions | join("|")][], .association_counts.total] | @csv' >> 20.11_target_list.csv cat 20.11_search-data.json | jq -r 'select(.type=="target") | {"ensembl_id": .id, "hgnc_approved_symbol": .approved_symbol, "uniprot_accessions": .uniprot_accessions, "number_of_associations": .association_counts.total}' > 20.11_target_list.json
echo '"efo_id","disease_full_name","number_of_associations"' > 20.11_disease_list.csv cat 20.11_search-data.json | jq -r 'select(.type=="disease") | [.id, .full_name, .association_counts.total] | @csv' >> 20.11_disease_list.csv cat 20.11_search-data.json | jq -c 'select(.type=="disease") | {"efo_id": .id, "disease_full_name": .full_name, "number_of_associations": .association_counts.total}' > 20.11_disease_list.json
Gzip and change the header Eg, gsutil setmeta -h "Content-Type:application/x-gzip" gs://open-targets-data-releases/20.11/output/20.11_target_list.json.gz
Please keep this ticket opened. The new pipeline has to manage the creation of these files.
Tagged for 21.02 release
21.02 will still generate these files with this manual process.
This functionality won't be necessary in the new pipeline, as we will make all ETL outputs accessible. We will do it manually one more time for 21.02.
Closing this issue as no action is expected on "automatic creations" for the data_pipeline/Angular
The dumps for 21.02 are available here: https://storage.googleapis.com/open-targets-data-releases/21.02/output/21.02_target_list.json.gz https://storage.googleapis.com/open-targets-data-releases/21.02/output/21.02_target_list.csv.gz https://storage.googleapis.com/open-targets-data-releases/21.02/output/21.02_disease_list.json.gz https://storage.googleapis.com/open-targets-data-releases/21.02/output/21.02_disease_list.csv.gz
No relevant for rewrite.
The ticket platform/issues/657 explains how to create the dumps for the list of disease and targets.
The command has to be integrated in the platform-infrastructure script. (run.sh)