single-cell-genetics / cellSNP

Pileup biallelic SNPs from single-cell and bulk RNA-seq data
Apache License 2.0
74 stars 11 forks source link

Running cellSNP on transcriptomic BAM from salmon alevin #14

Open lmweber opened 4 years ago

lmweber commented 4 years ago

Hi, thanks for your previous responses a while back.

I am now trying to use cellSNP/Vireo with a SAM/BAM output file from salmon alevin instead of Cell Ranger, and wanted to see if this would be feasible somehow.

The main issue seems to be that salmon alevin maps to the transcriptome and creates a transcriptomic BAM, while Cell Ranger maps to the genome and creates a genomic BAM. Since cellSNP expects a genomic BAM, it then gives me errors due to not finding chromosome references in the transcriptomic BAM.

I just checked with the salmon alevin authors on Slack this morning to find out more about this - I believe the chromosome reference is normally stored in the RNAME field in the genomic BAM, but in the transcriptomic BAM this field contains the transcript name. The genomic/transcriptomic coordinates are also different.

Do you know if there would be a way to run cellSNP directly on a transcriptomic BAM, or convert this internally somehow? Alternatively, the salmon alevin authors pointed me towards some tools to directly convert transcriptomic to genomic BAM, which could be a way around this - but I thought I would check here first.

The main reason this has come up is that we tend to have various issues with Cell Ranger (slow runtime, memory problems, multi-mapping reads), so we would prefer to use salmon alevin in our pipelines, and were wondering if this has already been considered. Thanks!

lmweber commented 4 years ago

Just following up on this - I have been recommended this tool sam-xlate, which can convert a BAM from genomic to transcriptomic coordinates: https://github.com/mozack/ubu/wiki

However it isn't clear to me yet whether it can also go the other way - transcriptomic to genomic coordinates, which is what would be needed to run cellSNP on the salmon alevin output. Will try it out though and see if this works.

huangyh09 commented 4 years ago

Hi, a quick answer is that if you can convert the genome based VCF into transcriptome based VCF, cellSNP may work straightforward (in mode 1). This means you need to change to chromosome id into transcript id and SNP position on the transcript. Also, keep an eye on the CB and UMI barcodes, as I'm not sure if salmon works similarly as STAR in cellranger.

lmweber commented 4 years ago

Oh cool, thanks for this idea, I'll look into this. I am already manipulating the vcf file to keep only variants in 3' UTR regions (this greatly speeds up runtime, for only a small drop in performance - using some code originally from Pete Hickey), so it may be possible to add this somehow.

Yes, the cell barcodes and UMI barcodes were also an issue. We asked the salmon maintainers (Avi Srivastava) about this, and Avi has kindly added an update in the latest development version of salmon, which includes the barcodes (CB and UR fields). (This discussion was in the Bioconductor Slack, #singlecell-queries channel, if you are interested in joining it.)

lmweber commented 4 years ago

Thanks for this idea. Unfortunately, I don't think it worked in the end. I used Ensembl Variant Effect Predictor (VEP) tool to get the transcriptome ID and position for each variant, and then rebuilt the vcf file to have transcript ID in column #CHROM, and transcriptome position in POS. But cellSNP was still giving me partial errors like the following, and the output files are empty, so I think it is not able to match the transcripts in the BAM file with this setup.

22.00% positions processed.
24.00% positions processed.
Can't find references chrENST00000558905.1 in samFile
Warning: chrom is None
26.01% positions processed.
28.01% positions processed.

So unfortunately it seems it may not be possible to use cellSNP with alevin for now. I have tried both of the following options:

If you have any other ideas, please let me know, since it would be really useful for us if we could somehow use cellSNP / Vireo together with alevin. For now, I think I will go back to using Cell Ranger instead of alevin in our pipeline, since this does work (but Cell Ranger is much slower and has issues with multi-mapping reads, which is why we prefer alevin).

Thanks again for your help.