mflamand / Bullseye

Bullseye analysis pipeline for DART-seq analysis
MIT License
12 stars 4 forks source link

Does all naturally occurring C-to-U mutations be excluded (SNP) in Bullseye? #10

Closed ghost closed 1 year ago

ghost commented 1 year ago

Hi Mathieu, I want to use Bullseye for single-cell DARTseq to identify C-to-U sites. I want to know if Bullseye already includes SNP delete.

Your Bullseye mentioned PCR duplicates were removed from the BAM files using Samtools (1.11) fixmate and markdup with the -r option.

How about the known mutations in the human genome (dbSNP 150), as well as endogenous C-to-U editing sites identified by sequencing of wild-type HEK293T cells, were these sites removed during using Bullseye the parseBAM.pl script and Find_edit_site.pl?

If SNP is not removed in Bullseye, then I need to remove these SNP of the single cell RNAseq bam file.

Many thanks, Zongmin Liu

mflamand commented 1 year ago

Hi Zongmin,

When using Find_edit_site.pl, SNP can be removed using the --filterBed snp.bed option, by providing a bed file with the position of all SNPs from dbSNP. if you want to exclude more sites, you can add another --filterBed endogenous_c2u.bed line. If you want to see which sites were excluded, you can add the --printFilteredSites option, which print a file "filename.excluded_sites.bed" in the same directory and which containing excluded sites.

Alternatively, you can do this after running Find_edit_site.pl using bedtools intersect.

I am not sure exactly what was done for the single cell analysis as I am not the one who has done it.

However, since the analysis is comparing DART data to a control sample, endogenous/background levels of C-to-U editing are already considered. That is assuming that endogenous editing is the same in the sequenced populations. If the control cells are the same population as DART cells, it should be the case.

Please let me know if you have any other questions

ghost commented 1 year ago

Thanks, Mathieu, Your suggestions are very helpful~ I notice in your Find_edit_site.pl mentioned SNP:

Here we are reading an optional bed files which contains known SNPs or regions to be excluded

several file can be provided, all regions will be added to the same hash table, strand is not considered for filtering these regions

my $excluded_sites={}; if (@bedfiles){ say "sites in files: @bedfiles will not be considered in analysis";# if $verbose; foreach my $files (@bedfiles){ my $bed_stem = check_chr($files,0); if (! $skip_chr_check and $bed_stem =~ /^chr/i and ! $annotation_stem =~ /^chr/i){say "Error, annotation file and bed files do not have the same chromosome annotation. Cannot match UCSC or Ensembl style."; exit();}

open(my $fh, "<",  $files);
while(<$fh>){
        next if $_=~/^[\#\n]/;
        my($chr,$start,$end,$strand) = (split("\t",$_))[0,1,2,5];
        foreach my $pos ($start+1..$end){
            $excluded_sites->{$chr}->{$pos}=1;
        }
    }
}

}

For single cell SNP site, I think the bed file should contain $chr,$start,$end,$strand, and $barcode. I will use the second method you recommend: after running Find_edit_site.pl using bedtools intersect.