leekgroup / recount-website

Code for the Recount project
https://jhubiostatistics.shinyapps.io/recount/
MIT License
11 stars 5 forks source link

<NA> values in rse_tx.RData #20

Closed hwartmann closed 5 years ago

hwartmann commented 5 years ago

Hey

I've noticed that there are some transcripts that contain NA's in the assay count table. E.g. ENST00000622420.1 in DRP001055. In this case there are NA's for all four samples. In the GTEx data there are a total of 4.2 million NA's e.g. for transcript ENST00000604479.5, but here it's only for a subset of the samples.

Could you please verify that for me and let me know how to interpret this? I've been struggling with his for a few days now.

Thank you

lcolladotor commented 5 years ago

Hi @hwartmann,

This is basically the same as the second question in https://github.com/leekgroup/recount/issues/18 that Jack Fu @JMF47 will answer.

Best, Leo

JMF47 commented 5 years ago

Hi @hwartmann, I have responded in the other thread. Brief recap here is that when read-lengths of samples differ, we have differing abilities to estimate transcript abundances.

hwartmann commented 5 years ago

Thank you for getting back to me @JMF47

So what is your suggestion to deal with these transcripts? Can I set the NA's to zero or should I drop any transcripts containing a NA?

JMF47 commented 5 years ago

What is your objective? I would recommend against setting NAs to 0. Whether or not you drop a transcript that contains any NAs depends on what you would like to do with the data.

hwartmann commented 5 years ago

Will I run into the same issue if I work with recount2 gene or exon counts?

JMF47 commented 5 years ago

I do not believe so, but @lcolladotor can chime in on the gene and exon count front.

hwartmann commented 5 years ago

OK, thanks. But in any case, we do not really understand how what you described can result in NA. Could you maybe elaborate a bit more or point me to source that would explain this to us?

JMF47 commented 5 years ago

https://www.biorxiv.org/content/biorxiv/early/2018/01/12/247346.full.pdf. Particularly, the estimation of the feature matrix, which calculates the expect number of counts falling into each exon/junction feature depending for a random read of a certain read-length.

lcolladotor commented 5 years ago

There are no NAs on the counts for the gene/exon RSE objects. The counting method is different for those than for the transcript ones. See https://f1000research.com/articles/6-1558/v1 for the gene/exon ones.