markziemann / dee2

Digital Expression Explorer 2 (DEE2): a repository of uniformly processed RNA-seq data
http://dee2.io
GNU General Public License v3.0
39 stars 7 forks source link

QC_summary vs QC_SUMMARY #78

Closed uilnauyis closed 4 years ago

uilnauyis commented 4 years ago

Hi! Thank you for the latest update of the package.

I have tried the updated package. I noticed that in the 'coldata' of the SumnarizedExperiment object returned by DEE2, there are 'QC_summary' and 'QC_SUMMARY' columns. I noticed the values in these two columns are different. Would you explain a bit about the differences betwee these two columns?

In my project, I use quality control to filter out failed samples. I used to use 'QC_summary', but after the update, I notice for some samples, the value of 'QC_summary' is WARNING(....), while in 'QC_SUMMARY', the value is 'FAILED(....)'. I wonder which one I should use in order to do quality control

Many thanks!

markziemann commented 4 years ago

Hello @uilnauyis thanks for the question.

It looks like the QC_summary has been duplicated. The column "QC_SUMMARY" might be an error. Do you have an example where the two columns are different?

uilnauyis commented 4 years ago

Hi, sorry for my late replay.

I find the following samples with different values: "SRR1783836" "SRR1783837" "SRR1783838" "SRR1999221" "SRR2153338" "SRR2153409" "SRR2153289"

As far as I have observed, I believe 'QC_SUMMARY' seems to be more consistent with the previous version of dee2, but that is just my observation. For example, for sample 'SRR2153289', there is no read count at all for all the gene, which obviously should be a 'failed' sample. Its 'QC_SUMMARY' is 'FAIL', but 'QC_summary' is 'WARNING'.

Thank you

markziemann commented 4 years ago

Hi @uilnauyis I have tried to check this but to me there doesn't appear to be a problem. For example the comparisons below indicate there is no discrepancy between QC_summary and QC_SUMMARY. Do you have a reproducible example?

library("getDEE2")

mdat <- getDEE2::getDEE2Metadata(species = "hsapiens")

# first test SRP062203
md <- mdat[grep("SRP062203",mdat$SRP_accession),]

x <- getDEE2(species = "hsapiens", SRRvec = md$SRR_accession, legacy = TRUE, metadata = mdat)

x$QcMx[30,] == x$MetadataFull$QC_summary

xse <- getDEE2::se(x)

xse@colData@listData$QC_summary == xse@colData@listData$QC_SUMMARY

# first test SRP053034
md <- mdat[grep("SRP053034",mdat$SRP_accession),]

x <- getDEE2(species = "hsapiens", SRRvec = md$SRR_accession, legacy = TRUE, metadata = mdat)

x$QcMx[30,] == x$MetadataFull$QC_summary

xse <- getDEE2::se(x)

xse@colData@listData$QC_summary == xse@colData@listData$QC_SUMMARY
uilnauyis commented 4 years ago

Sorry again for replying to you late.

After more comparison between the previous version and the current, I notice the following discrepancy and could be reproduced by the following code:

## I am taking the samples I listed in my previous comment for convenience. 
## First download with the legacy library.
xLegacy <- getDEE2(species = "hsapiens", SRRvec = c("SRR1783836", "SRR1783837", "SRR1783838", "SRR1999221", "SRR2153338", "SRR2153409", "SRR2153289"), legacy = TRUE)

## Then download with the new version of the library.
x <- getDEE2(species = "hsapiens", SRRvec = c("SRR1783836", "SRR1783837", "SRR1783838", "SRR1999221", "SRR2153338", "SRR2153409", "SRR2153289"))

## Run this step and it actually prints false, which should be true.
sum(xLegacy$GeneCounts[, c('SRR2153289')]) == sum(assay(x)[, c('SRR2153289')])

However, if the dee2 data is downloaded for each individual sample with the SRR accession, then the result seems to be consistent:

## download data of 'SRR2153289' only with the legacy library.
xLegacy <- getDEE2(species = "hsapiens", SRRvec = c("SRR2153289"), legacy = TRUE)

## download data of 'SRR2153289' only with the new version of the library.
x <- getDEE2(species = "hsapiens", SRRvec = c( "SRR2153289"))

## Run this step and it actually true, which is expected
sum(xLegacy$GeneCounts[, c('SRR2153289')]) == sum(assay(x)[, c('SRR2153289')])

Thank you for looking into the problem.

markziemann commented 4 years ago

Thanks for this. Based on the result below, it is likely the problem is with the se() function reordering the columns.

> colSums(assay(x))
SRR1783836 SRR1783837 SRR1783838 SRR1999221 **SRR2153289** **SRR2153338** **SRR2153409** 
  30838326   31521751   32761933   91207949    2630896    3354491    3286177 
> colSums(xLegacy$GeneCounts)
SRR1783836 SRR1783837 SRR1783838 SRR1999221 **SRR2153338** **SRR2153409** **SRR2153289** 
  30838326   31521751   32761933   91207949    2630896    3354491    3286177 

For now you can get reliable performance if you reorder the SRR accessions on your side before query.

vec <- c("SRR1783836", "SRR1783837", "SRR1783838", "SRR1999221", "SRR2153338", "SRR2153409", "SRR2153289")
veco <- vec[order(vec)]
xo <- getDEE2(species = "hsapiens", SRRvec = veco)
markziemann commented 4 years ago

I have just pushed a simple fix that reorders the SRR vector which looks to have solved the issue. Let me know if this needs a more thorough fix.

markziemann commented 3 years ago

Hi,

The DEE2 team have developed a new feature that lets users request a SRA transcriptome project to be completed in demand. You will receive an email when it's ready.

Give it a try at http://dee2.io/request.html and let us know if you have any suggestions on the feature.

Thanks, Mark Z

On Mon., 28 Sep. 2020, 23:05 uilnauyis, notifications@github.com wrote:

Sorry again for replying to you late.

After more comparison between the previous version and the current, I notice the following discrepancy and could be reproduced by the following code:

I am taking the samples I listed in my previous comment for convenience.

First download with the legacy library.

xLegacy <- getDEE2(species = "hsapiens", SRRvec = c("SRR1783836", "SRR1783837", "SRR1783838", "SRR1999221", "SRR2153338", "SRR2153409", "SRR2153289"), legacy = TRUE)

Then download with the new version of the library.

x <- getDEE2(species = "hsapiens", SRRvec = c("SRR1783836", "SRR1783837", "SRR1783838", "SRR1999221", "SRR2153338", "SRR2153409", "SRR2153289"))

Run this step and it actually prints false, which should be true.

sum(xLegacy$GeneCounts[, c('SRR2153289')]) == sum(assay(x)[, c('SRR2153289')])

However, if the dee2 data is downloaded for each individual sample with the SRR accession, then the result seems to be consistent:

download data of 'SRR2153289' only with the legacy library.

xLegacy <- getDEE2(species = "hsapiens", SRRvec = c("SRR2153289"), legacy = TRUE)

download data of 'SRR2153289' only with the new version of the library.

x <- getDEE2(species = "hsapiens", SRRvec = c( "SRR2153289"))

Run this step and it actually true, which is expected

sum(xLegacy$GeneCounts[, c('SRR2153289')]) == sum(assay(x)[, c('SRR2153289')])

Thank you for looking into the problem.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/markziemann/dee2/issues/78#issuecomment-699993655, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKVGIAZV6I5I3SDHHIW4YDSICC3BANCNFSM4QFNDBAQ .