Closed DSuveges closed 7 months ago
So far, the following files are good candidates for ingestion:
COLO1
-rw-r----- 1 dsuveges dsuveges 14M Jun 15 2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/GENERAL_STATS/4GUIDES/ENCORE_COLO1/SCALED_EXACT/AVERAGE/FINAL.gene.stats.annotated.txt
-rw-r----- 1 dsuveges dsuveges 153M Jun 15 2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/BLISS_STATS/4GUIDES/ENCORE_COLO1/SCALED_EXACT/COLO1_FINAL_EXACT_SCALED_ZSCORE_annotated.txt
COLO2
-rw-r----- 1 dsuveges dsuveges 18M Jun 15 2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/GENERAL_STATS/4GUIDES/ENCORE_COLO2/SCALED_EXACT/AVERAGE/FINAL.gene.stats.annotated.txt
-rw-r----- 1 dsuveges dsuveges 176M Jun 15 2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/BLISS_STATS/4GUIDES/ENCORE_COLO2/SCALED_EXACT/COLO2_FINAL_EXACT_SCALED_ZSCORE_annotated.txt
COLO3
-rw-r----- 1 dsuveges dsuveges 19M Jun 15 2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/GENERAL_STATS/4GUIDES/ENCORE_COLO3/SCALED_EXACT/AVERAGE/FINAL.gene.stats.annotated.txt
-rw-r----- 1 dsuveges dsuveges 179M Jun 15 2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/BLISS_STATS/4GUIDES/ENCORE_COLO3/SCALED_EXACT/COLO3_FINAL_EXACT_SCALED_ZSCORE_annotated.txt
BRCA1
-rw-r----- 1 dsuveges dsuveges 9.3M Jun 15 2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/GENERAL_STATS/4GUIDES/ENCORE_BRCA1/SCALED_EXACT/AVERAGE/FINAL.gene.stats.annotated.txt
-rw-r----- 1 dsuveges dsuveges 103M Jun 15 2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/BLISS_STATS/4GUIDES/ENCORE_BRCA1/SCALED_EXACT/BRCA1_FINAL_EXACT_SCALED_ZSCORE_annotated.txt
The size of the BLISS files are multiple times larger than the lfc files. Looking at the data looks like this:
Gene_Pair ABCB1~AKT1
Note LibraryCombinations
MyNote LibraryCombinations
Gene1 ABCB1
Gene2 AKT1
SIDM00118_CPID1137_gene1 0.09880821294868122
SIDM00118_CPID1137_gene2 0.31474548736806957
SIDM00118_CPID1137_observed -0.3652968663906671
SIDM00118_CPID1137_observed_expected -0.20454966921301757
SIDM00118_CPID1137_pval 0.47368816237899203
SIDM00118_CPID1137_zscore -0.7164910978741945
SIDM00118_CPID1140_gene1 0.0225949072451517
SIDM00118_CPID1140_gene2 0.4180128892392855
SIDM00118_CPID1140_observed 0.0066079615242936275
SIDM00118_CPID1140_observed_expected -0.1015222059259426
SIDM00118_CPID1140_pval 0.5826911584237727
SIDM00118_CPID1140_zscore 0.5494580316117462
SIDM00118_CPID1143_gene1 0.010999000562159203
SIDM00118_CPID1143_gene2 0.4064111447365547
SIDM00118_CPID1143_observed -0.1683627331654147
SIDM00118_CPID1143_observed_expected -0.1673361806656628
SIDM00118_CPID1143_pval 0.9638148826476801
SIDM00118_CPID1143_zscore 0.04536687633389127
It seems:
File extraction from the data dump:
# List of datasets to be included:
LIBRARIES=(COLO1 COLO2 COLO3 BRCA1)
# Analysis id:
ANALISIS="4GUIDES"
# Testing if all files are in the release:
for LIBRARY in ${LIBRARIES[@]}; do
echo $LIBRARY
# lfc:
ls -lah ${HOME}/temp/ENCORE_RELEASE_4/ANALYSIS/GENERAL_STATS/${ANALISIS}/ENCORE_${LIBRARY}/SCALED_EXACT/AVERAGE/FINAL.gene.stats.annotated.txt
# BLISS:
ls -lah ${HOME}/temp/ENCORE_RELEASE_4/ANALYSIS/BLISS_STATS/${ANALISIS}/ENCORE_${LIBRARY}/SCALED_EXACT/${LIBRARY}_FINAL_EXACT_SCALED_ZSCORE_annotated.txt
done
output:
COLO1
-rw-r----- 1 dsuveges dsuveges 14M Jun 15 2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/GENERAL_STATS/4GUIDES/ENCORE_COLO1/SCALED_EXACT/AVERAGE/FINAL.gene.stats.annotated.txt
-rw-r----- 1 dsuveges dsuveges 153M Jun 15 2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/BLISS_STATS/4GUIDES/ENCORE_COLO1/SCALED_EXACT/COLO1_FINAL_EXACT_SCALED_ZSCORE_annotated.txt
COLO2
-rw-r----- 1 dsuveges dsuveges 18M Jun 15 2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/GENERAL_STATS/4GUIDES/ENCORE_COLO2/SCALED_EXACT/AVERAGE/FINAL.gene.stats.annotated.txt
-rw-r----- 1 dsuveges dsuveges 176M Jun 15 2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/BLISS_STATS/4GUIDES/ENCORE_COLO2/SCALED_EXACT/COLO2_FINAL_EXACT_SCALED_ZSCORE_annotated.txt
COLO3
-rw-r----- 1 dsuveges dsuveges 19M Jun 15 2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/GENERAL_STATS/4GUIDES/ENCORE_COLO3/SCALED_EXACT/AVERAGE/FINAL.gene.stats.annotated.txt
-rw-r----- 1 dsuveges dsuveges 179M Jun 15 2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/BLISS_STATS/4GUIDES/ENCORE_COLO3/SCALED_EXACT/COLO3_FINAL_EXACT_SCALED_ZSCORE_annotated.txt
BRCA1
-rw-r----- 1 dsuveges dsuveges 9.3M Jun 15 2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/GENERAL_STATS/4GUIDES/ENCORE_BRCA1/SCALED_EXACT/AVERAGE/FINAL.gene.stats.annotated.txt
-rw-r----- 1 dsuveges dsuveges 103M Jun 15 2023 /home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/BLISS_STATS/4GUIDES/ENCORE_BRCA1/SCALED_EXACT/BRCA1_FINAL_EXACT_SCALED_ZSCORE_annotated.txt
TEST_FILE=/home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/BLISS_STATS/${ANALISIS}/ENCORE_COLO1/SCALED_EXACT/COLO1_FINAL_EXACT_SCALED_ZSCORE_annotated.txt
paste <(head -n1 ${TEST_FILE} | tr "\t" "\n" ) \
<(head -n2 ${TEST_FILE} | tail -n+2 | tr "\t" "\n" ) \
| column -t | head -n15
Gene_Pair ABCB1~AKT1
Note LibraryCombinations
MyNote LibraryCombinations
Gene1 ABCB1
Gene2 AKT1
SIDM00118_CPID1137_gene1 0.09880821294868122
SIDM00118_CPID1137_gene2 0.31474548736806957
SIDM00118_CPID1137_observed -0.3652968663906671
SIDM00118_CPID1137_observed_expected -0.20454966921301757
SIDM00118_CPID1137_pval 0.47368816237899203
SIDM00118_CPID1137_zscore -0.7164910978741945
SIDM00118_CPID1140_gene1 0.0225949072451517
SIDM00118_CPID1140_gene2 0.4180128892392855
SIDM00118_CPID1140_observed 0.0066079615242936275
SIDM00118_CPID1140_observed_expected -0.1015222059259426
TEST_FILE=/home/dsuveges/temp/ENCORE_RELEASE_4/ANALYSIS/BLISS_STATS/4GUIDES/ENCORE_BRCA1/SCALED_EXACT/BRCA1_RUNMERGED_EXACT_SCALED_ZSCORE_annotated.txt
paste <(head -n1 ${TEST_FILE} | tr "\t" "\n" ) \
<(head -n2 ${TEST_FILE} | tail -n+2 | tr "\t" "\n" ) \
| column -t | head -15
Output:
Gene_Pair ABL1~AKT1
Note LibraryCombinations
MyNote LibraryCombinations
Gene1 ABL1
Gene2 AKT1
SIDM00122_CPID1781_gene1 0.47530418331866797
SIDM00122_CPID1781_gene2 0.46557820920933
SIDM00122_CPID1781_observed 0.09374369062892664
SIDM00122_CPID1781_observed_expected 0.32529735197077925
SIDM00122_CPID1781_pval 0.5244222694751168
SIDM00122_CPID1781_zscore -0.636543451236575
SIDM00122_CPID1784_gene1 -0.2501479829969767
SIDM00122_CPID1784_gene2 0.3591541514872992
SIDM00122_CPID1784_observed -0.1873451267967813
SIDM00122_CPID1784_observed_expected -0.26087780463489024
# List of datasets to be included:
LIBRARIES=(COLO1 COLO2 COLO3 BRCA1)
# Analysis id:
ANALISIS="4GUIDES"
for LIBRARY in ${LIBRARIES[@]}; do
echo $LIBRARY
# Copying lfc file:
lfc_source_path=${HOME}/temp/ENCORE_RELEASE_4/ANALYSIS/GENERAL_STATS/${ANALISIS}/ENCORE_${LIBRARY}/SCALED_EXACT/AVERAGE
lfc_target_path=${HOME}/temp/${lfc_source_path/\/home\/dsuveges\/temp\/ENCORE_RELEASE_4\/}
# Create target path:
mkdir -p $lfc_target_path
# Copy file:
cp ${lfc_source_path}/FINAL.gene.stats.annotated.txt ${lfc_target_path}/
# Copying bliss data files:
bliss_source_path=${HOME}/temp/ENCORE_RELEASE_4/ANALYSIS/BLISS_STATS/${ANALISIS}/ENCORE_${LIBRARY}/SCALED_EXACT
bliss_target_path=${HOME}/temp/${bliss_source_path/\/home\/dsuveges\/temp\/ENCORE_RELEASE_4\/}
# Create target path:
mkdir -p ${bliss_target_path}
# Copy file:
cp ${bliss_source_path}/${LIBRARY}_FINAL_EXACT_SCALED_ZSCORE_annotated.txt ${bliss_target_path}/
done
The files were moved to evidence bucket: gs://otar013-ppp/encore/input/2024.01.11-Freeze4
From Inigo:
Concerning the files, we are using the 2 guide analysis, apparently there was a long discussion concernign this point before my arrival and this is the data we use for final deliveries. I'm not so sure about the path you're using for the log FC, the fact is that there are many redundancies in this system and maybe we are falling into one, but this will be the path I'm using:
ENCORE_RELEASE_4/ANALYSIS/MERGED_COUNTS_SCALED/2GUIDES/ENCORE_{library}/PROCESSED_COUNTS/
There, for sgRNA: {library}_FINAL_EXACT_logFC_sgRNA_scaled.txt and for gene level: {library}_FINAL_EXACT_logFC_Gene_scaled.txt
Updating files accordingly:
Freeze4 is not planned to be released for 24.03, instead a new release is expected to be ready for the June Platform release.
Based on the discussion with Inigo, the most recent (Freeze4) encore dataset is fine to use for generating disease target evidence. Colo1,2,3 is fine, Breast 1 is fine. Breast2,3 has systemic issues, so that will be ignored for now.