opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Update PPP evidence set with the most recent encore release #3188

Closed DSuveges closed 7 months ago

DSuveges commented 8 months ago

Based on the discussion with Inigo, the most recent (Freeze4) encore dataset is fine to use for generating disease target evidence. Colo1,2,3 is fine, Breast 1 is fine. Breast2,3 has systemic issues, so that will be ignored for now.

DSuveges commented 8 months ago

So far, the following files are good candidates for ingestion:

COLO1
-rw-r----- 1 dsuveges dsuveges 14M Jun 15  2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/GENERAL_STATS/4GUIDES/ENCORE_COLO1/SCALED_EXACT/AVERAGE/FINAL.gene.stats.annotated.txt
-rw-r----- 1 dsuveges dsuveges 153M Jun 15  2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/BLISS_STATS/4GUIDES/ENCORE_COLO1/SCALED_EXACT/COLO1_FINAL_EXACT_SCALED_ZSCORE_annotated.txt
COLO2
-rw-r----- 1 dsuveges dsuveges 18M Jun 15  2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/GENERAL_STATS/4GUIDES/ENCORE_COLO2/SCALED_EXACT/AVERAGE/FINAL.gene.stats.annotated.txt
-rw-r----- 1 dsuveges dsuveges 176M Jun 15  2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/BLISS_STATS/4GUIDES/ENCORE_COLO2/SCALED_EXACT/COLO2_FINAL_EXACT_SCALED_ZSCORE_annotated.txt
COLO3
-rw-r----- 1 dsuveges dsuveges 19M Jun 15  2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/GENERAL_STATS/4GUIDES/ENCORE_COLO3/SCALED_EXACT/AVERAGE/FINAL.gene.stats.annotated.txt
-rw-r----- 1 dsuveges dsuveges 179M Jun 15  2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/BLISS_STATS/4GUIDES/ENCORE_COLO3/SCALED_EXACT/COLO3_FINAL_EXACT_SCALED_ZSCORE_annotated.txt
BRCA1
-rw-r----- 1 dsuveges dsuveges 9.3M Jun 15  2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/GENERAL_STATS/4GUIDES/ENCORE_BRCA1/SCALED_EXACT/AVERAGE/FINAL.gene.stats.annotated.txt
-rw-r----- 1 dsuveges dsuveges 103M Jun 15  2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/BLISS_STATS/4GUIDES/ENCORE_BRCA1/SCALED_EXACT/BRCA1_FINAL_EXACT_SCALED_ZSCORE_annotated.txt

The size of the BLISS files are multiple times larger than the lfc files. Looking at the data looks like this:

Gene_Pair                             ABCB1~AKT1
Note                                  LibraryCombinations
MyNote                                LibraryCombinations
Gene1                                 ABCB1
Gene2                                 AKT1
SIDM00118_CPID1137_gene1              0.09880821294868122
SIDM00118_CPID1137_gene2              0.31474548736806957
SIDM00118_CPID1137_observed           -0.3652968663906671
SIDM00118_CPID1137_observed_expected  -0.20454966921301757
SIDM00118_CPID1137_pval               0.47368816237899203
SIDM00118_CPID1137_zscore             -0.7164910978741945
SIDM00118_CPID1140_gene1              0.0225949072451517
SIDM00118_CPID1140_gene2              0.4180128892392855
SIDM00118_CPID1140_observed           0.0066079615242936275
SIDM00118_CPID1140_observed_expected  -0.1015222059259426
SIDM00118_CPID1140_pval               0.5826911584237727
SIDM00118_CPID1140_zscore             0.5494580316117462
SIDM00118_CPID1143_gene1              0.010999000562159203
SIDM00118_CPID1143_gene2              0.4064111447365547
SIDM00118_CPID1143_observed           -0.1683627331654147
SIDM00118_CPID1143_observed_expected  -0.1673361806656628
SIDM00118_CPID1143_pval               0.9638148826476801
SIDM00118_CPID1143_zscore             0.04536687633389127

It seems:

DSuveges commented 8 months ago

File extraction from the data dump:

# List of datasets to be included:
LIBRARIES=(COLO1 COLO2 COLO3 BRCA1)

# Analysis id:
ANALISIS="4GUIDES"

# Testing if all files are in the release:
for LIBRARY in ${LIBRARIES[@]}; do
    echo $LIBRARY
    # lfc:
    ls -lah ${HOME}/temp/ENCORE_RELEASE_4/ANALYSIS/GENERAL_STATS/${ANALISIS}/ENCORE_${LIBRARY}/SCALED_EXACT/AVERAGE/FINAL.gene.stats.annotated.txt
    # BLISS:
    ls -lah ${HOME}/temp/ENCORE_RELEASE_4/ANALYSIS/BLISS_STATS/${ANALISIS}/ENCORE_${LIBRARY}/SCALED_EXACT/${LIBRARY}_FINAL_EXACT_SCALED_ZSCORE_annotated.txt
done

output:

COLO1
-rw-r----- 1 dsuveges dsuveges 14M Jun 15  2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/GENERAL_STATS/4GUIDES/ENCORE_COLO1/SCALED_EXACT/AVERAGE/FINAL.gene.stats.annotated.txt
-rw-r----- 1 dsuveges dsuveges 153M Jun 15  2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/BLISS_STATS/4GUIDES/ENCORE_COLO1/SCALED_EXACT/COLO1_FINAL_EXACT_SCALED_ZSCORE_annotated.txt
COLO2
-rw-r----- 1 dsuveges dsuveges 18M Jun 15  2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/GENERAL_STATS/4GUIDES/ENCORE_COLO2/SCALED_EXACT/AVERAGE/FINAL.gene.stats.annotated.txt
-rw-r----- 1 dsuveges dsuveges 176M Jun 15  2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/BLISS_STATS/4GUIDES/ENCORE_COLO2/SCALED_EXACT/COLO2_FINAL_EXACT_SCALED_ZSCORE_annotated.txt
COLO3
-rw-r----- 1 dsuveges dsuveges 19M Jun 15  2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/GENERAL_STATS/4GUIDES/ENCORE_COLO3/SCALED_EXACT/AVERAGE/FINAL.gene.stats.annotated.txt
-rw-r----- 1 dsuveges dsuveges 179M Jun 15  2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/BLISS_STATS/4GUIDES/ENCORE_COLO3/SCALED_EXACT/COLO3_FINAL_EXACT_SCALED_ZSCORE_annotated.txt
BRCA1
-rw-r----- 1 dsuveges dsuveges 9.3M Jun 15  2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/GENERAL_STATS/4GUIDES/ENCORE_BRCA1/SCALED_EXACT/AVERAGE/FINAL.gene.stats.annotated.txt
-rw-r----- 1 dsuveges dsuveges 103M Jun 15  2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/BLISS_STATS/4GUIDES/ENCORE_BRCA1/SCALED_EXACT/BRCA1_FINAL_EXACT_SCALED_ZSCORE_annotated.txt

Investigating lfc dataset:

TEST_FILE=/home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/BLISS_STATS/${ANALISIS}/ENCORE_COLO1/SCALED_EXACT/COLO1_FINAL_EXACT_SCALED_ZSCORE_annotated.txt
paste <(head -n1  ${TEST_FILE} | tr "\t" "\n" ) \
    <(head -n2  ${TEST_FILE} | tail -n+2 | tr "\t" "\n" ) \
    | column -t | head -n15
Gene_Pair                             ABCB1~AKT1
Note                                  LibraryCombinations
MyNote                                LibraryCombinations
Gene1                                 ABCB1
Gene2                                 AKT1
SIDM00118_CPID1137_gene1              0.09880821294868122
SIDM00118_CPID1137_gene2              0.31474548736806957
SIDM00118_CPID1137_observed           -0.3652968663906671
SIDM00118_CPID1137_observed_expected  -0.20454966921301757
SIDM00118_CPID1137_pval               0.47368816237899203
SIDM00118_CPID1137_zscore             -0.7164910978741945
SIDM00118_CPID1140_gene1              0.0225949072451517
SIDM00118_CPID1140_gene2              0.4180128892392855
SIDM00118_CPID1140_observed           0.0066079615242936275
SIDM00118_CPID1140_observed_expected  -0.1015222059259426

Investigating bliss dataset:

TEST_FILE=/home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/BLISS_STATS/4GUIDES/ENCORE_BRCA1/SCALED_EXACT/BRCA1_RUNMERGED_EXACT_SCALED_ZSCORE_annotated.txt
paste <(head -n1  ${TEST_FILE} | tr "\t" "\n" ) \
    <(head -n2  ${TEST_FILE} | tail -n+2 | tr "\t" "\n" ) \
    | column -t  | head -15 

Output:

Gene_Pair                             ABL1~AKT1
Note                                  LibraryCombinations
MyNote                                LibraryCombinations
Gene1                                 ABL1
Gene2                                 AKT1
SIDM00122_CPID1781_gene1              0.47530418331866797
SIDM00122_CPID1781_gene2              0.46557820920933
SIDM00122_CPID1781_observed           0.09374369062892664
SIDM00122_CPID1781_observed_expected  0.32529735197077925
SIDM00122_CPID1781_pval               0.5244222694751168
SIDM00122_CPID1781_zscore             -0.636543451236575
SIDM00122_CPID1784_gene1              -0.2501479829969767
SIDM00122_CPID1784_gene2              0.3591541514872992
SIDM00122_CPID1784_observed           -0.1873451267967813
SIDM00122_CPID1784_observed_expected  -0.26087780463489024

Extracting data files

# List of datasets to be included:
LIBRARIES=(COLO1 COLO2 COLO3 BRCA1)

# Analysis id:
ANALISIS="4GUIDES"

for LIBRARY in ${LIBRARIES[@]}; do
    echo $LIBRARY

    # Copying lfc file:
    lfc_source_path=${HOME}/temp/ENCORE_RELEASE_4/ANALYSIS/GENERAL_STATS/${ANALISIS}/ENCORE_${LIBRARY}/SCALED_EXACT/AVERAGE
    lfc_target_path=${HOME}/temp/${lfc_source_path/\/home\/dsuveges\/temp\/ENCORE_RELEASE_4\/}

    # Create target path:
    mkdir -p $lfc_target_path

    # Copy file:
    cp ${lfc_source_path}/FINAL.gene.stats.annotated.txt ${lfc_target_path}/

    # Copying bliss data files:
    bliss_source_path=${HOME}/temp/ENCORE_RELEASE_4/ANALYSIS/BLISS_STATS/${ANALISIS}/ENCORE_${LIBRARY}/SCALED_EXACT
    bliss_target_path=${HOME}/temp/${bliss_source_path/\/home\/dsuveges\/temp\/ENCORE_RELEASE_4\/}

    # Create target path:
    mkdir -p ${bliss_target_path}

    # Copy file:
    cp ${bliss_source_path}/${LIBRARY}_FINAL_EXACT_SCALED_ZSCORE_annotated.txt ${bliss_target_path}/
done

The files were moved to evidence bucket: gs://otar013-ppp/encore/input/2024.01.11-Freeze4

DSuveges commented 8 months ago

From Inigo:

Concerning the files, we are using the 2 guide analysis, apparently there was a long discussion concernign this point before my arrival and this is the data we use for final deliveries. I'm not so sure about the path you're using for the log FC, the fact is that there are many redundancies in this system and maybe we are falling into one, but this will be the path I'm using:
ENCORE_RELEASE_4/ANALYSIS/MERGED_COUNTS_SCALED/2GUIDES/ENCORE_{library}/PROCESSED_COUNTS/
There, for sgRNA: {library}_FINAL_EXACT_logFC_sgRNA_scaled.txt and for gene level:  {library}_FINAL_EXACT_logFC_Gene_scaled.txt

Updating files accordingly:

DSuveges commented 7 months ago

Freeze4 is not planned to be released for 24.03, instead a new release is expected to be ready for the June Platform release.