TALON on single-cell data

TinyTasy commented 1 year ago

Dear TALON team,

Thank you so much for your interesting and helpful package! As I am very new to the field of bioinformatics in general, I am not quite sure if this is the right place to ask here.

Nevertheless, I have a simple (and maybe somewhat dense) question: Is it possible to use TALON on long-read single-cell data prepared with 10x Genomics? For my Master's thesis, this would be very helpful.

Any help is greatly appreciated!

Best wishes, Tasy

callumparr commented 1 year ago

Assume you have long read data from single? If so seems like it would be OK to use the pseudo bulk data forgetting what read belongs to which cell at the moment. You generate the transcriptome then use the new transcriptome in the single cell pipeline you have, remapping the demultiplexed reads to the expanded annotation.

The tricky bit is how to filter the database to get a confident set of novel transcript models. Reason being you usually use sample replicates to filter reproducibility. what that equivalent would be in single cell dataset. You could first bin the reads into pseudo individual datasets based on cluster it belongs to. This works off the assumption cells in a cluster are of a similarity for your purpose of comparing across replicates to ascertain reproducibility across 10x runs if you have more than one run.

Failing that set a higher than normal minimum read coverage to qualify as robust model.

If you have a parallel short read set even if that is just bull RNA seq from same pool of cells you could use something like stringtie to use the single cell long read and short read data as hybrid method to create an annotation from single cell dataset

I am randomer who uses TALON and does a little single cell RNA seq work . So best to wait for a more qualified response just I give some ideas.

callumparr commented 1 year ago

Or use https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02525-6

TinyTasy commented 1 year ago

Dear callumparr,

Thank you for your ideas! Actually, they sound really reasonable to me - I do still need some time to process them, as I am very new to this field. I am also grateful about the paper that you send me, I will read through it. Luckily I will also have parallel short read samples, so this might indeed be very helpful to me.

As you see, I sadly can't write you a full-fledged answer, but this is good food for thought.

Thank you again! Tasy

fairliereese commented 1 year ago

Hi there! Actually TALON does work with single-cell data. All you need to do is make sure you have the cell barcodes for each read listed in the "CB" tag in the BAM file you're using and use the --cb option when running TALON. Using this strategy, TALON will split your reads based on the barcodes found in the file rather than based on the dataset names provided in the config file. You can check out the details about running TALON in single-cell mode in the README.

callumparr commented 1 year ago

Hi there! Actually TALON does work with single-cell data. All you need to do is make sure you have the cell barcodes for each read listed in the "CB" tag in the BAM file you're using and use the --cb option when running TALON. Using this strategy, TALON will split your reads based on the barcodes found in the file rather than based on the dataset names provided in the config file. You can check out the details about running TALON in single-cell mode in the README.

Oh so we can get a transcript x barcode matrix abundance file at the end?

How do perform the filtering? If one barcode is one dataset then the no of supporting reads for a transcript would be low.

fairliereese commented 1 year ago

For datasets with fewer cells, you can run the abundance table generation scripts no problem. However as you scale your datasets up, the abundance utility becomes prohibitively slow due to the all too common big matrix sparse data problem with single-cell data. I have written scripts myself to generate scanpy AnnData objects from the read_annot file as an alternative that I am happy to share if you want.

When I have performed filtering on sc long-read data, I typically don't use the TALON filter in its default mode. There are a few strategies I've used in the past.

Filter based on the number of cells each novel transcript is seen in (employed in our LR-Split-seq paper). This strategy is easy enough to use in the current TALON schema by choosing the filter parameters --minDatasets <# of desired cells> --minCount=1.
Arguably the better, but obviously not trivial way, is to generate a matching bulk long-read transcriptome on the same samples. You can then use this GTF transcriptome to initialize your TALON database that you run the single-cell long reads through, and create a custom pass list that just consists of all the transcript IDs that passed filtering in the bulk. Again I can expand on this if need be.

catsargent commented 1 year ago

Hi @fairliereese,

Firstly, thanks for your work in developing LR-splitpipe and TALON! I also want to use TALON on single cell data. I have been following the methods in your LR-split-seq paper to process short read and long read scRNA splitseq samples. I have processed the short read data with the parse bioscience splitseq pipeline. I have also run LR-splitseq on the LR data. However, I have an additional complication in that there are 7 human samples + 1 chimp sample. I need to now run minimap2 but I am unsure about how to deal with the fact that I have cells from chimp as well as human. Would be grateful if you could make any suggestions about how to deal with that.

Also, I would definitely be interested in the script for generating an AnnData object from the read_annot file!

Many thanks!

fairliereese commented 1 year ago

I'll get you the AnnData script at some point, but to your other question, are the human and chimp samples multiplexed in the same fastqs and Split-seq experiment? If so, one way to deal with this is by finding the barcodes that correspond to each species and subsetting your fastq output from LR-Splitpipe based on that, which would be an extra step to add in your pipeline before mapping where you'll likely want to map each read to the correct species.

yxsee commented 1 year ago

Hi @fairliereese, I'm interested in the AnnData script as well! It'll be great if the script can be added as a utility to TALON!

fairliereese commented 1 year ago

Hi everyone! I'm very excited to let y'all know that I've added a new TALON utility to make gene or transcript-level AnnData objects from a TALON database. The tool should work very similarly to the abundance file script. Please try it out and let me know if it gives you any problems!

https://github.com/mortazavilab/TALON#talon_adata

mortazavilab / TALON

TALON on single-cell data #115