mflamand / Bullseye

Bullseye analysis pipeline for DART-seq analysis
MIT License
12 stars 4 forks source link

Problem using Find_edit_site.pl for single cell data #7

Closed XRCHAM closed 1 year ago

XRCHAM commented 1 year ago

Hi there,

I wonder how to use find_edit_site.pl for the matrix generated from scRNA-seq data. I could not find cell_ID information anymore after using find_edit_site.pl to the matrix files generated by "parseBAM.pl --mode SingleCell --Cell_ID_pattern 10X", and also the control coverages are all 10, and control editing rates are all 0 as below:

image

My command: perl Find_edit_site.pl --annotationFile gencode.vM23.refFlat --EditedMatrix Exp.cov5.rdup.output.matrix.gz --controlMatrix Ctrl.cov5.rdup.output.matrix.gz --minEdit 5 --maxEdit 90 --editFoldThreshold 1.5 --MinEditSites 3 --cpu 4 --outfile Exp.vs.Ctrl.rdup.output.bed --fallback Mus_musculus.GRCm38.dna.primary_assembly.fa

Thanks!

mflamand commented 1 year ago

Hi,

I believe that you will also need to use "-SingleCell" option when using Find_RNA_edit_sites.pl to keep track of cell ID. Have you looked in the matrix files if these are there?

The 10 coverage and 0 mutation usually means that there was no coverage at this position in the control matrix. When using the fallback option, it will report the minimum coverage filter value default is 10). This can be changed with: -ControlMinCoverage ( and -EditedMinCoverage for dart sample). However in this case "genome" should be reported instead of "control" in the fourth column. This may be a bug I will need to look at.

Can you confirm if you have coverage that position in the matrix file? You can do so with

tabix file.matrix chr1:24613255 | head

Let m

XRCHAM commented 1 year ago

Thanks for your reply. It looks like this: (base) ..$ tabix output.matrix.gz chr1:24613255 | head chr1 24613255 0 10 0 0 0 10 AAACGAATCTGCCTGT chr1 24613255 0 10 0 0 0 10 AACGTCAAGTAAACGT chr1 24613255 0 10 0 0 0 10 AAGCCATGTACCGGAA chr1 24613255 0 10 0 0 0 10 AAGTTCGTCGAGATGG chr1 24613255 0 10 0 0 0 10 AATCACGGTGGCAGAT chr1 24613255 0 10 0 0 0 10 AATCGACTCAGCAATC chr1 24613255 0 10 0 0 0 10 AATGGCTTCGTCCTCA chr1 24613255 0 10 0 0 0 10 ACAGGGAAGCGCATCC chr1 24613255 0 10 0 0 0 10 ACATGCATCTCGGGAC chr1 24613255 0 10 0 0 0 10 ACCCTCAGTCTTTCTA

mflamand commented 1 year ago

Thank you for your reply. I think I know what is the issue here.

For analysis of single cell data, we typically use a "pseudobulk" matrix as control. This control file needs to have a single line per genomic position and would not have the cell IDs in the control matrix. By using this pseudobulk sample , we compare editing in each DART cell to the average editing across all control cells and not in any individual control cell. In the Mol cell study, I believe this was generated by combining the bam files from all control cells prior to generating the matrix,

For more detail, I will direct you here : https://pubmed.ncbi.nlm.nih.gov/36042888/, as well as this github page: https://github.com/tegowski/scDARTHEKcells which contain more detail on the single-cell analysis. (which was not done by me).

This pseudo-bulk reference sample can also be generated by collapsing your existing control matrix using the collapse_matrix.pl. which i just added to the Code/scripts directory.

Please let me know if that makes sense.

XRCHAM commented 1 year ago

Thanks for the update. I will try to follow your directions.