liulab-dfci / MAESTRO

Single-cell Transcriptome and Regulome Analysis Pipeline
GNU General Public License v3.0
276 stars 76 forks source link

Is MAESTRO compatible with 10X data derived from nuclear RNA? #98

Closed Dazcam closed 3 years ago

Dazcam commented 3 years ago

Hello,

I'm currently installing the MAESTRO prerequisites and, after reading the paper, I'd like to ask if MAESTRO is compatible with 10X data derived from nuclear RNA, particularly if I'm looking to integrate single-modal snRNA- and snATAC-seq data?

And more specifically, could the use of a pre-mRNA reference and GTF files for alignment, as opposed to standard reference/annotation files, impact a MAESTRO analysis at all?

Until now I have been using Cell Ranger 4 for my analysis which recommends using a pre-mRNA reference and GTF file for nuclear RNA. I had started creating STARsolo compatible versions of these files for my MAESTRO analysis and wondered if this is the best course of action, particularly as 10X have recently released v5 which includes a new function for dealing with intronic reads without the need of a pre-RNA reference, and STARsolo also provides a similar function.

Regardless, it would be useful to hear if you have any recommendations or points of interest that I should consider when running MAESTRO using single-nuclear data.

Many Thanks,

Darren

crazyhottommy commented 3 years ago

Hi, MAESTRO uses STARsolo for scRNAseq quantification. You can add --soloFeatures GeneFull for single-nuclei data after you initiate the Snakefile manually at https://github.com/liulab-dfci/MAESTRO/blob/master/MAESTRO/Snakemake/scRNA/Snakefile#L48

In the future, we should expose that as a parameter in the config.yaml file.

Thanks!

Dazcam commented 3 years ago

Many thanks for responding. I will add that command to the Snakefile today and see if it runs to completion. The pipeline hit the skids after the scrna_rseqc_genecov rule. Although that rule completed without error the logs reported the following warning:

Cannot get coverage signal from 14510_PFC_RNAAligned.sortedByCoord.out.sample.bam ! Skip

    Sample  Skewness
@ 2021-01-09 00:14:17: Running R script ...

Likely a mismatch between the BED and BAM files. This caused the pipeline to choke during the scrna_rseqc_plot rule as the RNAGenebodyCoveragePlot could not be generated.

Error in `[.data.frame`(gene_cov, , 2) : undefined columns selected
Calls: RNAGenebodyCoveragePlot -> [ -> [.data.frame

I also had a buffer size issue. I assume this is due to my samples being sequenced extremely deeply?

EXITING because of fatal error: buffer size for SJ output is too small
Solution: increase input parameter --limitOutSJcollapsed

I managed to solve it by adding the following line in shell command of the scrna_map rule.

--limitOutSJcollapsed 5000000 

Source here. May be worth adding this somewhere in config or docs?

Are you planning on adding ssclusteval to the pipeline?

Dazcam commented 3 years ago

UPDATE: 13th Jan 2021

When running with the --soloFeatures GeneFull parameter the directory names of some of the output files are changed such that they do not match what is specified in the Snakefile.

Instead of: Result/STAR/%sSolo.out/Gene/raw/matrix.mtx

They are stored in Result/STAR/%sSolo.out/GeneFull/raw/matrix.mtx

I think this only affects the scrna-map and scrna_qc rules.

Error message:

MissingOutputException in line 21 of /scratch/c.c1477909/maestro_analysis/14510_PFC_RNAv2/Snakefile:
Job Missing files after 5 seconds:
Result/STAR/14510_PFC_RNASolo.out/Gene/raw/matrix.mtx
Result/STAR/14510_PFC_RNASolo.out/Gene/raw/features.tsv
Result/STAR/14510_PFC_RNASolo.out/Gene/raw/barcodes.tsv
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 0 completed successfully, but some output files are missing. 0

Removing output files of failed job scrna_map since they might be corrupted:
Result/STAR/14510_PFC_RNAAligned.sortedByCoord.out.bam, Result/STAR/14510_PFC_RNAAligned.sortedByCoord.out.bam.bai
Shutting down, this might take some time.

I have modified the Snakefile and now running MAESTRO again.

crazyhottommy commented 3 years ago

Thanks for reporting, we will keep this in our mind and make it in our next release!

crazyhottommy commented 3 years ago

Hi, we just made a new release MAESTRO1.5.1 which supports single-nuclei data. Can you please give it a try? Thanks!

Dazcam commented 3 years ago

Thanks for the update. Unfortunately I had to abandon using Maestro due to the issues I was having around the time I posted. I now have a well developed pipeline of my own for my single-nuclei data but will keep my eye on Maestro's development and may consider using in the future.

crazyhottommy commented 3 years ago

Thanks for the feedback!

njohnso6 commented 1 year ago

I got the same error: EXITING because of fatal error: buffer size for SJ output is too small Solution: increase input parameter --limitOutSJcollapsed When running the newest version 1.5.4 (only available on the macs3 fork) to run the multiome pipeline. I have yet to try the solution previously proposed. Will let you know.