Closed MichaelaEBI closed 5 years ago
But this is exactly what they do at the Expression Atlas, right? Why do we need to process their results further if they consider this to be how the results should be reported?
Another thing to consider is that we only get the probes that pass the p-value and log fold change cut-offs. So if we summarised the probes that we get, we would not summarise the data for all available probes. Next to do item: assess how many genes we have multiple probes for in each comparison per study.
The Expression Atlas team is working on aggregating probes per gene. We will hear from them when they have implemented a solution.
According to Pablo from Atlas team microarray probe aggregation has been incorporated into their code, so there should only be one evidence string per gene pero comparison for both microarray and RNA-seq data in the next submission. The solution they have chosen is to select the probe with the highest average intensity among all the probes for a gene on a given experiment.
Atlas have implemented microarray probe aggregation for the 19.06 release but there are still 14 experiments with multiple genes per probe:
Experiment id | Contrasts name |
---|---|
E-GEOD-19279 | 'pancreatic ductal adenocarcinoma liver metastasis' vs 'normal' in 'liver' |
E-GEOD-19279 | 'primary pancreatic ductal adenocarcinoma' vs 'normal' in 'pancreas' |
E-GEOD-22529 | 'chronic lymphosyte leukemia' vs 'normal' on 'Affymetrix HG-U133B array' |
E-GEOD-29598 | 'formalin-fixed,parafin-embedded xenograft; messageAmp Premier methodology' vs 'fresh-frozen xenograft; normal' |
E-GEOD-29598 | 'fresh-frozen xenograft; messageAmp Premier methodology' vs 'fresh-frozen xenograft; normal' |
E-GEOD-29598 | 'infected with adenovirus expressing GFP; ; messageAmp Premier methodology' vs 'infected with adenovirus expressing GFP; ; normal' |
E-GEOD-29598 | 'infected with adenovirus expressing MYC; ; messageAmp Premier methodology' vs 'infected with adenovirus expressing MYC; ; normal' |
E-GEOD-29598 | 'infected with adenovirus expressing RAS; ; messageAmp Premier methodology' vs 'infected with adenovirus expressing RAS; ; normal' |
E-GEOD-31138 | 'invasive ductal carcinoma' vs 'normal' |
E-GEOD-32175 | 'lung cancer' vs 'normal' |
E-GEOD-44408 | 'lymph node metastasis from infiltrating ductal breast carcinoma' vs 'normal lymph node' |
E-GEOD-44408 | 'primary node-negative infiltrating ductal breast carcinoma' vs 'normal lymph node' |
E-GEOD-44408 | 'primary node-positive infiltrating ductal breast carcinoma' vs 'normal lymph node' |
E-GEOD-8514 | 'aldosterone-producing adenoma' vs 'normal' |
Atlas are investigating this and it should be fixed in their next release
Microarray probe aggregation has been completed for 19.09, the issue mentioned above has been resolved.
This is an example where multiple probes per gene are shown as separate pieces of evidence for the association between SFTPC & lung disease. Should multiple probes be summarised and treated as a single evidence string?
Expression Atlas currently submits data from 369 studies to Open Targets. In 298 studies at least 1 gene has more than one probe. 247 studies have genes with more than 2 probes, 208 studies with more than 3 probes, 167 for >4 probes and 122 studies have genes with more than 5 probes.