opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Ingesting CRISPR screen data from BioGRID Open Repository of CRISPR Screens (ORCS) #2920

Closed DSuveges closed 1 year ago

DSuveges commented 1 year ago

The BioGRID distributes CRISPR ~2k screen data under MIT license. As of 2023.04.02, the most recent available release is 1.1.13, which was released on October 2022. Files can be downloaded from here. The compressed archive contain two file types:

Study metadata file:

Screen data:

Conclusion

Once there is a solidified plan and a data model on what screens can feed into what datasets, it would be very easy to map study metadata with studies and extracting the significant is also trivial given the boolean column indicating hits.

The most difficult part is to interpret studies.

1. Scoping phase

The scoping phase is done in collaboration with @buniello.

DSuveges commented 1 year ago

Some of the stats on the studies

Screen types:

+-------------------------------+-----+
|SCREEN_TYPE                    |count|
+-------------------------------+-----+
|Negative Selection             |942  |
|Positive and Negative Selection|233  |
|Positive Selection             |183  |
|Phenotype Screen               |124  |
+-------------------------------+-----+

Experimental setup:

+-----------------------------------------+-----+
|EXPERIMENTAL_SETUP                       |count|
+-----------------------------------------+-----+
|Timecourse                               |1044 |
|Drug Exposure                            |271  |
|Virus Exposure                           |84   |
|Toxin Exposure                           |17   |
|Cytokine exposure                        |12   |
|Ligand Exposure                          |11   |
|Bacteria Exposure                        |9    |
|Other                                    |9    |
|NK cell exposure                         |8    |
|Implantation to Mouse Model              |5    |
|Radiation Exposure                       |4    |
|T cell exposure                          |3    |
|Oxygen Exposure                          |2    |
|Cytokine depletion                       |1    |
|Transferrin receptor (TFRC/CD71) exposure|1    |
|SARS-CoV-2 Spike-RBD exposure            |1    |
+-----------------------------------------+-----+

Setup vs screen type:

+-----------------------------------------+-------------------------------+-----+
|EXPERIMENTAL_SETUP                       |SCREEN_TYPE                    |count|
+-----------------------------------------+-------------------------------+-----+
|Timecourse                               |Negative Selection             |905  |
|Drug Exposure                            |Positive and Negative Selection|132  |
|Drug Exposure                            |Positive Selection             |103  |
|Timecourse                               |Phenotype Screen               |84   |
|Timecourse                               |Positive and Negative Selection|50   |
|Virus Exposure                           |Positive Selection             |48   |
|Drug Exposure                            |Negative Selection             |27   |
|Virus Exposure                           |Positive and Negative Selection|26   |
+-----------------------------------------+-------------------------------+-----+

Publications

Although screens are coming from 240+ publications, most of them the ~2k studies are coming from a handful of papersl:

+--------+-----+
|pubmedId|count|
+--------+-----+
|29083409|340  |
|30971826|325  |
|29526696|45   |
|33539788|36   |
|27260156|33   |
|32649862|31   |
|32990596|21   |
|30995489|18   |
|35559673|17   |
|34049503|16   |
+--------+-----+

When focusing on the TOP5 publications (['29083409', '30971826', '29526696', '33539788', '27260156']), which are representing almost 50% of all screens, the split of the experimental design is simpler:

+-------------------------------+------------------+-----+
|SCREEN_TYPE                    |EXPERIMENTAL_SETUP|count|
+-------------------------------+------------------+-----+
|Positive and Negative Selection|Drug Exposure     |36   |
|Negative Selection             |Timecourse        |743  |
+-------------------------------+------------------+-----+
DSuveges commented 1 year ago

@buniello has done the EFO mappings (spreadsheet) for cell proliferation screens based on the applied disease cell line eg. Acute Myeloid Leukemia Cell Line -> EFO_0000222 (acute myeloid leukemia)

These mappings will then be collected into one single table (cell-line vs efo) that can be used to join with study table upon evidence generation.

DSuveges commented 1 year ago

The heterogeneity of this dataset makes it very complicated to ingest hence the value gained is not proportional to the effort required.