kharchenkolab / dropEst

Pipeline for initial analysis of droplet-based single-cell RNA-seq data
GNU General Public License v3.0
86 stars 43 forks source link

DropEst does not accept cellbarcodes from 10x. #12

Closed bishwaG closed 6 years ago

bishwaG commented 6 years ago

I have got cellbarcode whitelist from 10X Genomics. I tried to link it via configuration xml file and tried running dropEst, but it did not accept the input. Cell barcodes look like following.

AAACCTGAGAAACCAT AAACCTGAGAAACCGC AAACCTGAGAAACCTA AAACCTGAGAAACGAG AAACCTGAGAAACGCC . . .

I got following warning. WARNING: barcodes line has wrong format: 'AAACCTGAGAAACCAT' WARNING: barcodes line has wrong format: 'AAACCTGAGAAACCGC' WARNING: barcodes line has wrong format: 'AAACCTGAGAAACCTA' WARNING: barcodes line has wrong format: 'AAACCTGAGAAACGAG' WARNING: barcodes line has wrong format: 'AAACCTGAGAAACGCC'

I used configs/10x.xml file and changed my_fiile.txt>myfile.

VPetukhov commented 6 years ago

As I remember, 10x barcodes has similar construction process as InDrop: both left and right parts of the barcode have its own whitelist. And the whole whitelist is created by randomly combining all possible values for left and right parts. For instance, for InDrop we have 384 possible values for the each part. Using this information we can significantly reduce search space, as we have to compare each barcode with 2 * 384 sequences instead of 384^2. Thus, it's required to store indrop-like barcodes in the two-column format, one column for each part.

It would be great if you can attach the barcodes file. In this case I just add it to the other lists in the appropriate format.

bishwaG commented 6 years ago

The file containing cellbarcodes can be downloaded from https://www.dropbox.com/s/rp8gyr7c9x070ep/737K-august-2016.txt?dl=0 I did not get what you meant by "left and right parts of the barcode".

VPetukhov commented 6 years ago

Thank you!
For inDrop v3, for instance, left part is the first 8 nucleotides of a barcode (e.g. GTTTGTTT), and right part is the last 8 nucleotides of a barcode (e.g. AAACAAAC). Combining them we obtain the full barcode (e.g. GTTTGTTTAAACAAAC).

ahy1221 commented 6 years ago

Hi, I also meet the same question about 10X cell barcode. There is only one part of 10X cell barcode. It is a 16bp sequence as indicated by 10x userguide To avoid the warning , I change the value to "const" I am wondering that whether this is the right way to use dropEst . @VPetukhov @bishwaG

VPetukhov commented 6 years ago

@ahy1221, @bishwaG Sorry, I have quite a lot of work this month and didn't fix this issue. Using "const" type is a valid solution, but in this case merge procedure can take much longer.

ahy1221 commented 6 years ago

Thank you very much for the quick replying.
After running with "const" type using 10x cell barcode list I got an error

Barcode 'CAAGAAATCATCATTC-1' has wrong length (16 expected)

It seems that it adds the extra "-1" to the original barcode and then complaining the wrong CB length. How can I fixed it ? @VPetukhov

VPetukhov commented 6 years ago

Actually, if you already used CellRanger, you can just skip the merge step (don't use -m option). While dropEst uses more complicated algorithm, in most cases CellRanger performs equally well. Otherwise you need to change the field Estimation/BamTags/cb in the config file. Default value is CB which is the tag for corrected barcode in 10x protocol. You need to set it to CR, which is the tag for raw cell barcode.

ahy1221 commented 6 years ago

Hi @VPetukhov , it works after changing BamTags as 'CR'. Since you mention that CellRanger performs equally well, I am interested in having a comparison between dropEst and cellRanger. I found that dropEst always gives more good cells than cellRanger (cellRanger ~700 cells V.S dropEst ~1000 cells). And the top UMIs from dropEst are quite low complexity. I am wondering that should I trust the automatic quality control from dropEst or do some manual filtering by myself. If so, what other manual filters you are suggesting for that data ? Below is the QC report of my data generated from dropEst, feel free to give any comments or advice. image image image image image image image image image

VPetukhov commented 6 years ago

@ahy1221 thank you for providing this report, your quality plot (Score vs Cell Rank) looks really interesting.

I am wondering that should I trust the automatic quality control from dropEst or do some manual filtering by myself. If so, what other manual filters you are suggesting for that data?

Score plot is an aggregation over several QC indicators, and in complicated cases (as yours) it's better to perform further analysis by hands.

I'd strongly recommend you to look at the plot of mitochondrial fraction. I mean to do it each time you get new dataset. We can't include it in the report because chromosome name, used for mitochondrial genes, varies between annotations. You can make this plot using the following code, replacing mit_chrom_name and path_to_dataset with the corresponding values (code is a part of low-quality-cells vignette):

holder <- readRDS(path_to_dataset)
mit_chromosome_fraction <- GetChromosomeFraction(holder$reads_per_chr_per_cells$Exon[colnames(holder$cm),], mit_chrom_name)

FractionSmoothScatter(mit_chromosome_fraction, plot.threshold=T, main='Chromosome')

I found that dropEst always gives more good cells than cellRanger (cellRanger ~700 cells V.S dropEst ~1000 cells).

Filtration algorithm, published by CellRanger is quite simple. DropEst heuristic for estimation of the number of cells isn't complex as well. Both of them can be mistaken sometimes. I prefer to use "Cell rank vs #UMIs * #Cells" plot to validate their results. In your case it looks consistent with dropEst estimation, so I think you can trust to this number.

And the top UMIs from dropEst are quite low complexity.

It's OK for long UMIs to have low number of molecules. But it's hard to be sure if it's alright in an individual case without dropest log. Could you, please, attach it?

Please, don't hesitate to write if you have some further questions.

ahy1221 commented 6 years ago

@VPetukhov Thank you very much for the quick replying. If I understand right, dropEst QC first use umi counts per cell to identify high quality cells and then calculate a cell score based on several QC indicators except the mitochondrial fraction. If so, should I trust label of high quality or the cell score ? What you suggest is that I need to recalculate the cell score by adding mt fraction. Here I showed the MT fraction of cells image

It seems most of cells are alive because the density are enriched in the bottom left. Then I calculate the cell scores following with default parameter and manual MT filter ( < 0.2) image image

The cell scores looks normal this time. This makes me a little confused because I can't reproduce the cell score plot generated from the dropReport.Rsc even with default parameter. Another question is that what is meaning of the metric #UMIs * #Cells and do such multiplication ? In my imagination, for a cell we can calculate #UMIs , but how can a cell got #Cells for calculation ?

Regarding top UMIs , what I mean is that is that the sequence of UMIs looks like a ployA/T tail (ATTTTTTTT or something ). Since the UMIs should be random sequence even with A/T preference I would expect some normal random sequences like ATCGAAAACCTTCTCC etc. This makes me worried about is that a true UMI barcode ? Attached the dropEst log file for that. Another thing I noticed from the log is that it has an ERROR saying that "ERROR: can't find chromosome, id: -1". The GTF I used is Ensemble Release 88. Thank you very much for the help !

dropEst.log

VPetukhov commented 6 years ago

If I understand right, dropEst QC first use umi counts per cell to identify high quality cells and then calculate a cell score based on several QC indicators except the mitochondrial fraction.

You're right about the order. Though, you can estimate a score using mitochondrial fraction by providing either list of mitochondrial genes or name of the chromosome to ScorePipelineCells (see low-quality-cells vignette).

If so, should I trust label of high quality or the cell score ?

Do you mean labels, estimated by EstimateCellsQuality? It's just another representation of thresholds, estimated by EstimateCellsNumber. Perhaps, I should make this function private.

Then I calculate the cell scores following with default parameter and manual MT filter ( < 0.2)

Did you filter both holder$cm and holder$cm_raw? Actually, the proper way to filter mit. fraction is to call

scores <- ScorePipelineCells(holder, mit.chromosome.name=chr_name, filter.high.mit.fraction=T)

(again, see low-quality-cells vignette).

This makes me a little confused because I can't reproduce the cell score plot generated from the dropReport.Rsc even with default parameter.

Do you mean you can't at all? dropReport.Rsc just call

holder <- readRDS('cell.counts.rds')
scores <- ScorePipelineCells(holder)

Did you try it? If it doesn't produce the result above, try to run d <- readRDS('cell.counts.rds') and next to run dropEst/scripts/report.Rmd line-by-line.

Another question is that what is meaning of the metric #UMIs * #Cells and do such multiplication ? In my imagination, for a cell we can calculate #UMIs , but how can a cell got #Cells for calculation ?

We expect that there would be a lot of cells with large number of UMIs, and even more cells with few UMIs (around expectation of background #UMIs/cell). But we expect that there would be only few cells with medium number of UMIs (kind of intersection of the right tail of real cells and the left tail of background cells). You can just plot #Cells vs log(#UMIs), but it looks ugly without any normalization. So we used #Cells * #UMIs (by kind recommendation of Dr. Allon Klein).

Regarding top UMIs , what I mean is that is that the sequence of UMIs looks like a ployA/T tail (ATTTTTTTT or something ). Since the UMIs should be random sequence even with A/T preference I would expect some normal random sequences like ATCGAAAACCTTCTCC etc.

We addressed this question in the paper (see Results - "Uneven UMI frequency distribution distorts molecular count estimation").

This makes me worried about is that a true UMI barcode ?

Shortly, don't worry. It's a common problem of at least droplet-based scRNA-seq.

Attached the dropEst log file for that. Another thing I noticed from the log is that it has an ERROR saying that "ERROR: can't find chromosome, id: -1". The GTF I used is Ensemble Release 88.

I suppose that it's a problem of CellRanger, I got it several times. BamTools library fails to read one line of their bam. You can ignore it.

dropEst.log

Nothing suspicious there. Speaking about reads per UMI problem: you have 82716693 * 0.62 = 51284350 reads and 4^10 = 1048576 UMIs. So, expectation of reads per UMI is 51284350 / 1048576 = 48.9 reads. So it's alright that the most extreme UMIs have 67-156 reads.

ahy1221 commented 6 years ago

Thanks you very much! After going through the vignette, paper and some inside source code in these two days, I summarized the cell-level quality control of dropestr as below:

image

Hope this help. Another question is about saturation info. Either in the paper Supplementary Note 1 or my own data are already saturated (as opposite to the mouse ES data). However, from the curve of UMIs V.S Sequenced Genes ( > 1 UMI ), it seems that the slope is quite straight ( not flat), indicating that if I can sequence more UMIs I can get more genes. Note that I performed UMI correction after QC. image

image Does this also mean I should sequence more ? From the your experience, what is the inter-relationship among reads, UMIs and sequenced genes ? Would they increase together ?

VPetukhov commented 6 years ago

Thank you a lot for your diagram! I'll add it in the documentation if you don't mind.

Could you please give some explanation or source code for the last figure (Genes vs UMIs)? Is it just number of UMIs per gene across all genes?

ahy1221 commented 6 years ago

Feel free to use any diagram I posted here. Regarding the figure Genes vs UMIs, it is just number of UMIs per gene across all genes for all the cells. I just import the filtered(after QC) UMI matrix into SingleCellExperiment Class and calculate total UMI counts and total genes sequenced ( > 1 UMI ) , code shows as below:

library(scran) 
library(scater)
library(SingleCellExperiment)
library(ggplot2)
cm <- readRDS("dropEst.filtered.matrix.rds")
sce <- SingleCellExperiment(assays= list(counts = cm))
sce <- scater::calculateQCMetrics(sce)
CellData <- as.data.frame(colData(sce))
ggplot(CellData, aes(x = total_counts, y = total_features)) +
geom_point()

I think my question about saturation is about how to estimate how deep is enough for droplet-based sequence. Specifically, how many reads is enough to sequence a cell to get a reasonable number of genes ? The reasonable number here also indicate that a gene should get enough UMIs for abundance estimation in the depth setting.

Thanks for your help again and Merry Christmas !

Yao He