sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
271 stars 67 forks source link

Transgene Alignment #260

Closed bwfait closed 3 years ago

bwfait commented 3 years ago

Hello,

Not filing this as a bug because, as far as I know, I am the source of the error here!

First, I'd like to thank you -- zUMIs has been working perfectly for my experiments, and I'm really grateful for how well-constructed and maintained it is!

At the moment, I'm attempting to align a transgene (in this case, an sfGFP fusion protein in the Smart-Seq3 protocol). The Readme suggested that by providing the extra FASTA, I would receive separate descriptive statistics for the mapping of this extra information; however, I see none. I receive a .gft file for the transgene, and the transgene shows up in the final annotation gtf. However, it is not present in the gene names, reads-per-gene, or expression data. I can't find any separate stats for my transgene.

Because the transgene would have to be a pretty deep internal read and this is just a small QC run of about 1 million reads, it's entirely possible I'm not seeing my transgene because no reads map to it. However, it also seems possible I added the transgene fasta incorrectly. Do you have any idea how I would know which is the case here?

Here's the FASTA I feed to zUMIs: sfGFP.txt

Thanks for any insight you might have!
Ben

cziegenhain commented 3 years ago

Hi Ben,

How do you give the fasta file into zUMIs? via the reference: additional_files: parameter?

External fasta files like this should show up as the category "User" in the output stats & plots!

bwfait commented 3 years ago

Yeah -- that's where I give the program the FASTA (YAML attached).

Nothing that I can see on the stats and plots. Does that just mean it's not expressed? Or would it still show up on the plot if no reads mapped to it?

Thanks for your help on this! Ben ss3YAML.txt

cziegenhain commented 3 years ago

Ok good to know! As you passed it correct and say you see the transgene in zUMIs final GTF file, the absence from any of the output really means there were 0 reads over the whole dataset. Sorry that this isnt more explicitly reported!

If you want to be absolutely sure you can also confirm that using samtools idxstats *.filtered.Aligned.GeneTagged.UBcorrected.sorted.bam

Best, Christoph

cziegenhain commented 3 years ago

Another comment: I am just seeing that you seem to have single-end 300 bp reads? Personal curiosity: which machine was this sequenced on? If the quality drops in the end of the reads (check fastqc) the standard STAR settings for mapping within zUMIs could be too strict. You should consider playing with the control of mismatches in STAR by passing those to additional_STAR_params !

bwfait commented 3 years ago

Thanks! I'm doing a bigger run now, and I can play around with it to see -- for this run it was just a QC check of ~1 million reads, and the fusion gene itself isn't too abundant, so it makes sense to me given the way I structured the library + the Smart-Seq3 protocol that it wouldn't show up.

Thanks for your help!