nih-cfde / icc-eval-core

(WIP) Tools for collecting and reporting CFDE metrics
https://cfde-eval.netlify.app/
1 stars 1 forks source link

Data assets gathered by DRC #12

Open seandavi opened 2 months ago

seandavi commented 2 months ago

This issue just captures the information shared by the DRC with us via email to help@cfde.cloud. These links were a response to my asking for API access to the information on https://info.cfde.cloud and https://data.cfde.cloud.

Currently, each file has at most a couple hundred rows. I think these correspond to links in this Data Matrix View

current dcc assets format

link    lastmodified    current creator dcc_id  drcapproved dccapproved deleted created
https://github.com/nih-cfde/LINCS-metadata/blob/main/scripts/process_mcf0a_c2m2.py  2024-02-20 20:16:23.088 False   stephanieolaiya@gmail.com   f3f490cf-fd69-579c-8ea3-472c7cf3fb59    False   False   True    2024-02-20 20:16:23.088
https://github.com/nih-cfde/LINCS-metadata/blob/main/scripts/process_mf10a_c2m2.py  2024-02-20 20:19:52.739 False   stephanieolaiya@gmail.com   f3f490cf-fd69-579c-8ea3-472c7cf3fb59    False   False   True    2024-02-20 20:19:52.739
https://cfde-drc.s3.amazonaws.com/SPARC/KG Assertions/2024-05-08/SPARC.zip  2024-05-08 15:47:03.429 True    sxie04@gmail.com    2399794e-74c6-5735-a039-0782cdeeb1e2    True    True    False   2024-05-08 15:47:03.429
https://zinc15.docking.org/substances/search/?q={drug.label}    2024-03-08 00:15:43.332 True    sxie04@gmail.com    a1289ebb-0306-59a1-b0fc-e4d03a4790d7    True    True    False   2024-03-08 00:15:43.332
https://github.com/nih-cfde/LINCS-metadata/blob/ain/scripts/process_mcf10a_c2m2.py  2024-02-21 22:08:41.275 False   stephanieolaiya@gmail.com   f3f490cf-fd69-579c-8ea3-472c7cf3fb59    False   False   True    2024-02-21 22:08:41.275
https://www.gtexportal.org/home/gene/{gene.ensembl} 2024-03-08 00:18:20.759 True    sxie04@gmail.com    b3028db2-209c-5862-8f4d-33c5b312332e    True    False   False   2024-03-08 00:18:20.759
https://cfde-drc.s3.amazonaws.com/Bridge2AI/KG Assertions/2024-03-01/tkg 2.0.zip    2024-03-01 05:45:05.879 True    jiaweixu@utexas.edu 75b3be39-a021-5d80-b7e2-2a7938a1e11a    False   True    False   2024-03-01 05:45:05.879
https://drugcentral.org/?q={drug.label} 2024-03-08 00:15:16.867 True    sxie04@gmail.com    a1289ebb-0306-59a1-b0fc-e4d03a4790d7    True    True    False   2024-03-08 00:15:16.867

current code assets format

type    name    link    description openAPISpec smartAPISpec    smartAPIURL entityPageExample
Apps URL    SenNet Data Portal  https://data.sennetconsortium.org/      False   False
API HuBMAP Entity API   https://entity.api.hubmapconsortium.org/    The HuBMAP Entity API is a standard RESTful web service with create, update and read operations for the standard HuBMAP provenance graph entities.  False   True    https://smart-api.info/ui/0065e419668f3336a40d1f5ab89c6ba3
API SenNet Search API   https://search.api.sennetconsortium.org The SenNet Search API is a thin wrapper of the Elasticsearch API. It handles data indexing and reindexing into the backend Elasticsearch. It also accepts the search query and passes through to the Elasticsearch with data access security check. False   True    https://smart-api.info/ui/10ed9b5eb8ff960d4431befc591ed842
API SenNet Entity API   https://entity.api.sennetconsortium.org The SenNet Entity API is a standard RESTful web service with create, update and read operations for the standard SenNet provenance graph entities.  False   True    https://smart-api.info/ui/7d838c9dee0caa2f8fe57173282c5812
Apps URL    SenNet Exploration User Interface   https://data.sennetconsortium.org/ccf-eui       False   False
Apps URL    LINCS Tools Marketplace https://lincsproject.org/LINCS/tools    The LINCS Tools Marketplace page serves a listing of applications produced  using LINCS datasets by the LINCS consortium.   False   False
Apps URL    SPARC Tools and Resources   https://sparc.science/tools-and-resources/tools SPARC Portal page listing SPARC associated tools and resources  False   False
API exRNA Atlas JSON-LD https://brl-bcm.stoplight.io/docs/exrna-atlas-json-api/ZG9jOjQ1Mg-overview      True    False
API CFDE GeneReg Linked Data Hub    https://genboree.org/cfde-gene-dev/ui/api-docs      False   False

current file assets format

filetype    filename    link    size    sha256checksum
KG Assertions   dictionary_NAR_databases_deduplicate.csv    https://cfde-drc.s3.amazonaws.com/Bridge2AI/KG Assertions/2024-03-01/dictionary_NAR_databases_deduplicate.csv   332431  idxebqqDNkRDqgG4IDziJPyzOG83XfQLznIPMUGypq8=
KG Assertions   Papers.csv  https://cfde-drc.s3.amazonaws.com/Bridge2AI/KG Assertions/2024-03-01/Papers.csv 305136  TFRyCZSHx9Y3RHZd0SXOa7T4+hEYTPlXupVFUPDZxis=
KG Assertions   tkg 2.0.zip https://cfde-drc.s3.amazonaws.com/Bridge2AI/KG Assertions/2024-03-01/tkg 2.0.zip    303893995   6ZpPhxpW95YWPREcYff5qteHdGaRitMkHeuQbGDerF8=
XMT MoTrPAC_Endurance_Trained_Rats_2023.gmt https://cfde-drc.s3.amazonaws.com/MoTrPAC/XMT/2024-03-05/MoTrPAC_Endurance_Trained_Rats_2023.gmt    187733  GHbcupKlcgz78vtTbdkSaY3RbiPNWzLNPMD104Yhugc=
XMT LINCS_XMT_2022-12-13_LINCS_L1000_Chemical_Pertubation_Consensus_Signatures.gmt  https://cfde-drc.s3.amazonaws.com/LINCS/XMT/2024-04-11/LINCS_XMT_2022-12-13_LINCS_L1000_Chemical_Pertubation_Consensus_Signatures.gmt   16319270    BxZ89Ja3/S7lTaA6yCv/Xhgwoh8EIy25JnF1Sk7VphY=
XMT testfile.gmt    https://cfde-drc.s3.amazonaws.com/LINCS/XMT/2024-04-23/testfile.gmt 80  MWcavCS4IMIBHGVvlq3AVOy9Qbgd6RgfDAh3CgjCdqY=
C2M2    2024-01-04T11_45_25.136063-a7bb912c-ab20-11ee-9ed4-02402ff490c1.zip https://cfde-drc.s3.amazonaws.com/HuBMAP/C2M2/2024-04-26/2024-01-04T11_45_25.136063-a7bb912c-ab20-11ee-9ed4-02402ff490c1.zip    1076294222  pmm476mNJOP4SB/kI83WaWNhb5QpBjXv3w7DFvEkuhY=
C2M2    hubmap-test-submission.zip  https://cfde-drc.s3.amazonaws.com/HuBMAP/C2M2/2024-04-26/hubmap-test-submission.zip 1076294222  pmm476mNJOP4SB/kI83WaWNhb5QpBjXv3w7DFvEkuhY=
KG Assertions   GTEx.zip    https://cfde-drc.s3.amazonaws.com/GTEx/KG Assertions/2024-04-29/GTEx.zip    149456464   icIIQG/ikXF55yVGBU8fzfObHFBYbdRo3Vguy0CiDko=
vincerubinetti commented 1 month ago

In #19 I have code to download and unzip all of the listed DRC "dcc" and "file" assets (the "code" assets don't really have a concrete thing to download). Unzipped, here is the file breakdown:

  (2) .1 files
  (2) .2 files
  (2) .3 files
  (721) . files
  (350) .csv files
  (19) .zip files
  (8) .download files
  (1,422) .txt files
  (264) .json files
  (11,206) .tsv files
  (47) .gmt files
  (19) .tsv~ files
  (4) .pdf files
  (2) .docx files
  (2) .swp files
  (8) .sh files
  (2) .obo files
  (2) .gz files
  (6) .DS_Store files
  (2) .3-cfde-submission_74d840d264792e9d7ab38fcb0deeba0eddbbebdb files
  (166) .numbers files
  (2) .json_orig files
  (2) .tsv_error files
  (2) .tsv_orig files
  (12) .orig files
  (2) .uri files
  (8) .new files
  14,284 files
  124.4 GB

Also, between "dcc" and "file", they are mostly overlapping. There's about 7k entries in each, there are only ~600 that are in one but not the other. The rest are exact matches.

The file contents are things like node/edge lists, compounds, samples, genes, and all sorts of stuff. Could you give me some guidance on what high-level info information I should compile from this that would be useful for program evaluation? I'm not sure what would be useful here.

Also, the data is naturally too big to store in the repo here for providence. I have the ingest process download all this data to /raw/temp, which is excluded with gitignore.

Could you give me some guidence

seandavi commented 1 month ago

Thanks, @vincerubinetti, for the details. I do not foresee a reason for downloading all the data. I need to investigate the contents of the DRC download link files to see what we can and should be tracking.

vincerubinetti commented 1 month ago

I have it all downloaded on my laptop, which took a long time, so if you want to take a look at anything in particular or collect some high-level info, let me know.

Personally, looking through all of it, I don't see anything there that would necessarily be useful for the purposes of evaluating the effectiveness of certain projects. In fact, I didn't see anything to link a particular asset to a particular project. But maybe a few of them have it and I just missed it.

Perhaps if we could associate assets to projects, things like "last updated" and "number of genes/nodes/edges" could be useful? Though even that feels like measuring programmer effectiveness via lines of code written (not accurate).

Nonetheless, #20 adds the infrastructure to do things like downloading large/many data files at once, unzipping them, etc... which we may want to do later anyway.