Barcode extraction and clone counting

morris-lab / CellTagR

This repository contains the CellTag data analysis R package to support clone calling and lineage reconstruction.

19 stars 9 forks source link

Barcode extraction and clone counting #21

Open crist156 opened 1 year ago

crist156 commented 1 year ago

I’m finding the CellTag system very easy to use (thank you!), but have run into hopefully a few small but solvable problems.

Pertinent information – I am using the CellTag V2 library and performed bulk sequencing on my cells to confirm that the barcode complexity of my cells matches that of the starting library. Having done this, I’m not concerned that my issues below are related to reduced barcode number after transduction. I performed single cell RNAseq on three samples – a day 0, day 7 and day 30 timepoint and am currently working on extracting/analyzing the CellTags. My cells had been transplanted in mice, so in order to reisolate them, I sorted for the GFP-expressing cells (thus also still retaining our CellTags).

Issue 1 - When I execute the barcode extraction for my Day 30 sample, I lose ~75% of the cells sequenced. This is in contrast to my Day 0 and Day 7 samples, where the presence of CellTags does filter out some cells, but nowhere near the majority. I'm confused why there would be so much loss in this specific sample when the number of cells sequenced are similar between Day 0 and Day 30, as are the file sizes (rough proxy for the reads) when I've executed the mapping steps just prior to extraction. I'm also not as worried about experimental loss of barcode because these cells were sorted off of the GFP transgene where the CellTag exists, so they should in theory all have at least one CellTag. Any thoughts on why the extraction might be faulty? Or how I might be able to check/troubleshoot this?

Issue 2 - I seem to have some sequence being counted in almost all of the cells. Any idea where this is coming from and how to remove it? I'm guessing this is an error on my end (mapping error? whitelist error?) since it would be incredibly unlikely that one single barcode exists in all of the cells. Have you ever seen something like this before?

Issue 3 - The number of clonal families that I find in my samples is quite low, but I also notice that these families can't exist as single cells (always have 2 or more cells). I think it's possible that there do exist single clones within my population of cells and at present I'm filtering these out. Is there any way to edit the code to allow for this possibility?

Thank you so much for your time and help! Happy to clarify any of the above. Any and all suggestions are incredibly welcome!

Thanks, Sarah

jindalk commented 1 year ago

Hi @crist156, thanks for using CellTagR!

Issue 1 - This cell loss is rather odd. Are you able to track down which step of the CellTag processing pipeline you are losing significant number of cells at? It might also be worth plotting GFP expression for each of the three samples to check expression levels of the reporter (you would have had to align your data to a custom reference containing GFP as a separate chromosome). More details are in the read me section: Single-Cell CellTag Extraction and Quantification

Issue 2 - We have sometimes seen certain celltags show up in a large fraction of cells in our data as well. However, it is also possible that this might just be a sequence mapping to the genome and is being erroneously identified as a celltag. You can extract this highly abundant celltag and BLAST it against your reference to check for that possibility. We've developed a new version of the CellTag processing pipeline where we explicitly check for such erroneous mapping events and filter them out, please feel free to give it a try: https://github.com/morris-lab/newCloneCalling Side note - it is unusual that you only have 40 unique tags (X-axis) across 1000s of cells. Is this expected/also prevalent in your other samples?

Issue 3 - I am not sure I fully understand this. Wouldn't clonal families/clones containing an individual cell just be any cell with > 0 celltags? Would a clone containing a single cell be helpful for your "clonal" analysis?

Kunal

crist156 commented 1 year ago

Hi Kunal,

Thanks for the response and the suggestions!

Issue 1 - I'll take a look at GFP expression across samples like you suggested, that could at least be a good validation if nothing else. I lose most of the cells at this initial extraction step. For me, this is when I run day30 <- CellTagExtraction(day30, celltag.version = "v2") where day30 is my CellTag object. Could this be an issue with my custom reference genome, do you think? Or is it more likely that the CellTags just aren't there?

Here's a snapshot of the parsed results post-extraction for day 0 ("good") and day 30 (the problem). Like I mentioned before, day 0 and day 30 roughly had the same number of GFP+ cells sequenced, so I wasn't expecting this big of a difference.

Issue 2 - Cool, thanks! I'll give the newCloneCalling a try.

Good question about the cell index being super low. Can you actually clarify if this index refers to the CellTag or the barcodes added at sequencing? I don't always see it as low as this, but you're right, its much lower than the 1000s of cells that exist in the data. It's weird, though, because there's definitely more than 40 unique tags for both CellTags and Cell.BC in the parsed results. Maybe the data collapsed weird? I didn't get any odd errors when running this though.

Issue 3- I realize why you're confused, sorry. I'll try to reword. Right now, I only see groups of 2+ cells defined as my clones (see example below, line 4), but I'm also interested in capturing clone IDs with a frequency of 1 (single cells with a unique ID) since they might represent rare populations that do not exist in large quantities like other clonal groups. Is there a way for me to include the cells with a frequency of 1? I'm hoping to track the dominant (high frequency) and non-dominant (low frequency) over time, as I predict this will fluctuate. Does that make sense?

> day7@clone.size.info[["v2"]]
    Clone.ID Frequency
1          1        35
2          2        79
3          3       119
4          4         2

Thanks for your help! Sarah

jindalk commented 1 year ago

Hi Sarah, I see that Day 30 has far fewer CellTag reads (~80k), is the sequencing depth for these cells much lower than day 0? If not then it is likely these cells contain fewer CellTag transcripts. I am assuming you used the same reference genome for both samples?

Apologies for the confusing axis labels. The X-axis on "CellTag Occurrence Frequency across all Cells" plot refers to the CellTags. Depending upon which stage of the processing pipeline these metric plots were created at, the total tags can be fewer than your original bam parse result. Regardless, 40 would still be quite low, unless you expect significant clonal expansion over time in these samples?

Re issue 3 - I think I now understand what you meant. In order to analyse clones that span multiple time points, you can process bam files for all time points together. CellTagR allows processing a list of bam files (detailed in section 2: Create a CellTag Object). This should allow you to create a joint cell x celltag matrix for all 3 time points. This is also supported by default in the new clone calling workflow I referred to in my last post.

Kunal

crist156 commented 1 year ago

Hi Kunal,

The sequencing depth for Day 30 was similar to Day 0 (similar number of total reads, but since Day 0 had a bit more cells, the depth for Day 30 was a bit more, theoretically).

Yup, same reference genome. So you're thinking the most likely explanation is just fewer CellTag transcripts? Still kinda weird...

Thanks for the explanation of the axis labels - that clarification helps a lot. I was predicting a pretty big bias in clone expansion over time, so not totally surprisingly to see a low number, but the starting number for Day 0 isn't super large either. Is there any way that I've overestimated the barcode complexity in my cells? I thought they were fine based on the bulk sequencing for whitelisting (saw almost all of the expected barcodes represented in the bulk population), but I have no other explanation for the low cell index in the metric plots. Could this be sequencing bias?

Thanks for the suggestion of merging the data via a joint matrix. I'll see how that goes!

Thanks for the input, Sarah

jindalk commented 1 year ago

Hi Sarah, I think a comparison of normalized GFP expression levels across time points would be the best way to assess this. Additionally, comparing Metric plots across days 0 and 30 before binarization would be informative. I suspect you are observing a drop in CellTag expression on day 30 due to transgene silencing. Since CellTags are randomly inserted into the genome, they can be silenced if the surrounding region undergoes changes in accessibility/chromatin state etc as a result of whatever process the cells are undergoing.

Regarding barcode complexity, did you perform bulk RNA or DNA-seq? Was it performed at the same time as Day 0?

Kunal

crist156 commented 1 year ago

Hi Kunal,

Ok, thanks for the input. I'll definitely look at the GFP expression across these timepoints to see if that accounts for these differences.

The bulk sequencing came from the same original cell stock, but not the exact one from this experiment (multiple frozen vials of cells where the bulk came from one and this study from another). I believe it was RNA sequencing. I adapted the steps from the Biddy et al protocol to PCR-amplify out the barcodes from my transduced cells (+ cDNA amplification) and ran on the Illumina MiSeq.

I was also wondering if I could also ask a little bit about the Jaccard analysis step. The protocol says that the default (and likely best) correlation coefficient is 0.7. Would it be ok to change this to 0.75 or 0.8? I was having a hard time understanding the literature surrounding these numbers and was hoping you might have a simpler explanation for it? From what I understand, less than 0.6 would mean that the clusters are unstable, between 0.6-0.75 "measures a pattern", and greater than 0.85 means that the clusters are highly stable. Would it be reasonable to play with the coefficient anywhere between 0.6 to 0.8? Any insight into this would be super helpful.

Thank you so much!

Take care, Sarah

jindalk commented 1 year ago

Hi Sarah, I am not sure why you see a drop in CellTag complexity between your cell stock and day 0. Is it a significant drop? Could there be any cell selection occurring prior to day 0 in your experiment? It is hard for me to say without looking at the actual data.

As for Jaccard threshold, it is loosely based on supp fig 1j of the Biddy et al paper. Jaccard similarity measures the similarity of CellTag signatures between 2 cells (more info). The clone calling pipeline builds a cell-cell Jaccard similarity matrix, binarizes it at the 0.7 threshold and then build a cell-cell graph from the binarized matrix. It then identifies clusters on this graph to identify clones. The clustering algorithm is deterministic (as opposed to Louvain clustering that is commonly used with single-cell data) so the clusters (clones) should be stable irrespective of the cutoff. The cutoff defines the stringency for identifying clones. A high cutoff would mean only cells with highly similar signature would be considered clonally related. I talk a bit more about why we don't just look at perfect celltag matches here: #17

As far as I know, the cluster stability argument is relevant to the 'resolution' parameter in louvain/leiden clustering, so you can definitely use a higher/lower cutoff if you'd like. In my experience, a lower cutoff (like 0.5-0.6) is ok if you are highly confident that unrelated cells have distinct celltag signatures (i.e. complexity of celltag library >> number of cells transduced or avg celltags/cell >> 1).

Best, Kunal