mflamand / Bullseye

Bullseye analysis pipeline for DART-seq analysis
MIT License
12 stars 4 forks source link

common_sites.min3rep.bed file is empty #13

Open lulumagic7 opened 1 year ago

lulumagic7 commented 1 year ago

Hi Mflamand,

I'm running the bulk DART data, I have 2 replicates for YTH-MUT, 3 replicates for YTH. Firstly, I got bam files using STAR, then I ran parseBAM.pl to get Dmut2.matrix.gz, Dmut3.matrix.gz, Dyth1.matrix.gz, Dyth2.matrix.gz, Dyth3.matrix.gz: perl /usr/local/bullseye/1.0.0/Code/parseBAM.pl --input /data/epifluidlab/lulu/hek293t/3_star/output/Dmut2Aligned.sortedByCoord.out.bam --output /data/epifluidlab/lulu/hek293t/3_star/parseBam/Dmut2.matrix --cpu 4 --minCoverage 10 –removeDuplicates

Then I ran Find_edit_site.pl on 3 YTH files against the 2 YTH-MUT files to get Dyth1_Dmut2.bed, Dyth2_Dmut2.bed, Dyth3_Dmut2.bed, Dyth1_Dmut3.bed, Dyth2_Dmut3.bed, Dyth3_Dmut3.bed: perl /usr/local/bullseye/1.0.0/Code/Find_edit_site.pl --annotationFile /data/epifluidlab/lulu/hek293t/4_ref/annotation.refFlat --EditedMatrix Dyth3.matrix.gz --controlMatrix Dmut3.matrix.gz --minEdit 5 --maxEdit 90 --editFoldThreshold 1.5 --MinEditSites 3 --cpu 4 --outfile Dyth3_Dmut3.bed --fallback /data/epifluidlab/lulu/gencode/GRCh38.primary_assembly.genome.fa --verbose

Then I ran summarize_sites.pl to get dythmut2.bed and dythmut3.bed, and then I merged the replicates files commonsites.bed: perl /usr/local/bullseye/1.0.0/Code/summarize_sites.pl --MinRep 3 --mut 3 dythmut2.bed dythmut3.bed > commonsites.bed. But my commonsites.bed is empty. Do you know how to fix it? Thanks so much!

Best,

Lu

mflamand commented 1 year ago

Hi Lu,

It is difficult to say without seeing the date. However, I think I can see the problem. In your last provided line : perl /usr/local/bullseye/1.0.0/Code/summarize_sites.pl --MinRep 3 --mut 3 dythmut2.bed dythmut3.bed > commonsites.bed you are trying to find common sites in 2 files (dythmut2.bed and dythmut3.bed) but are using the --MinRep 3 option. This would only find sites found in at least 3 replicates, but you are only providing 2 (from each bed files). So you would not expect any site int output file.

I would first merge sites found in each DART sample with

summarize_sites.pl --MinRep 2 --repOnly --mut 3 Dyth1_Dmut2.bed Dyth1_Dmut2.bed > Dyth1.bed summarize_sites.pl --MinRep 2 --repOnly --mut 3 Dyth2_Dmut2.bed Dyth2_Dmut2.bed > Dyth2.bed summarize_sites.pl --MinRep 2 --repOnly --mut 3 Dyth3_Dmut2.bed Dyth3_Dmut2.bed > Dyth3.bed

Then you can find the common sites across all replicates:

summarize_sites.pl --MinRep 3 --mut 3 Dyth1.bed Dyth2.bed Dyth3.bed > commonsites.bed

The same results can also be obtained using a combination of bedtools intersect, merge and unix sort. if the script is not working for you somehow.

Please let me know if that fixes the problem.

Best,

lulumagic7 commented 1 year ago

Hi Mflamand,

Thanks so much for your help! it works now! I just got the filtered file, out.RAC.bed.

image

it's a bed6 file, I'm wondering if you could share the scripts to do the downstream analysis, such as MetaplotR script and Homer script?

Thanks so much!

Best,

Lu

mflamand commented 1 year ago

Hi Lu,

I am glad it is now working!

I am don't really have premade scripts for MetaplotR and Homer.

for MetaplotR, I follow the instructions here https://github.com/olarerin/metaPlotR. with some custom plotting in R afterwards.

if you have already generated the prerequisite files from the genome fasta and genepred, you can then run:

perl annotate_bed_file.pl --bed m6a.sorted.bed --bed2 mm10_annot.sorted.bed > annot_m6a.sorted.bed

perl rel_and_abs_dist_calc.pl --bed annot_m6a.sorted.bed --regions region_sizes.txt > m6a.dist.measures.txt

the output file can then be read in R for plotting:

here is an example code plotting the results :

#load ggplot and tidy verse
library(tidyverse)

#read in file from metaplotR

m6a.dist<-read.delim("metagene/All.min2.dist.measures.txt", header = T) %>%
  mutate(position = case_when(rel_location <=1 ~ "5UTR",
                              rel_location >1 & rel_location < 2 ~ "cds",
                              rel_location >= 2 ~ "3UTR")) %>%
  mutate(., len = utr5_size + cds_size + utr3_size) %>%
  group_by(., gene_name ) %>%
  arrange(.,-len) %>%
  distinct(.,gene_name, coord, .keep_all = T) %>%
  arrange(.,chr, coord) %>%
  ungroup() 

# get scaling factors for UTR lengths 

utr5.SF <- median(m6a.dist$utr5_size, na.rm = T)/median(m6a.dist$cds_size, na.rm = T)
utr3.SF <- median(m6a.dist$utr3_size, na.rm = T)/median(m6a.dist$cds_size, na.rm = T)

# rescale position in 5'UTRs and 3'UTRs to account for differences in length with CDS

m6a.dist<-m6a.dist %>% 
  mutate( scaled_location = case_when(
                          rel_location < 1 ~ scales::rescale(rel_location, to = c(1-utr5.SF, 1), from = c(0,1)),
                          rel_location >= 1 & rel_location <= 2 ~ rel_location,
                          rel_location > 2 ~ scales::rescale(rel_location, to = c(2, 2+utr3.SF), from = c(2,3)),
                          )) 

 #use ggplot to generate density plot

ggplot(m6a.dist,aes(x=scaled_location))+
 geom_density()+
   geom_vline(xintercept = 1:2, col = "red", size=0.2) + # add lines at 5'-CDS and CDS-3'UTR junctions
   labs(y="Relative density", x=paste0("m\U2076","A sites relative position"))+   # add custom label
   scale_x_continuous(breaks = c(1,2,3), labels = c("Start","Stop","pA"))+
   scale_y_continuous(expand = expansion(0,0))+
   theme_classic()

For homer, you should use findMotifsGenome.pl. A good starting point could be:

findMotifsGenome.pl out.RAC.bed mm10 homer_results -len 6,8 -size 10 -rna

this will look for motifs of length 6 and 8 in a region (size) of 10nt around each site in the out.RAC.bed file. This means that the motif is not necessarily centered on the edit site. For this you could run with the same motif length (-len 6) and RNA size (-size 6). You may want to run homer without first filtering for RAC sites, as you are sure to enrich for the RAC motif in this case.

I think homer uses the ncbi notation for assembly names (eg. 1, not chr1) so you may need to change the bed file to this format to reflect this before you run it. You can do this with:

perl -pne 's/^chr//;' out.RAC.bed > out.RAC.ncbi.bed 

Let me know if you have another question

lulumagic7 commented 1 year ago

Hey Mflamand,

Thanks so much for your so detailed information!

I tried both the metaPlotR and Homer, and still have something want to discuss with you.

For the metaPlotR, I used hg38 reference genome, but I got some error because there was no chr1_KN196472v1_fix.fa in the hg38 chroms files (I already contact the authors, but haven't got the response). so I only can use the hg19 instead?

In addition, I tried the package Guitar (https://rdrr.io/bioc/Guitar/man/GuitarPlot.html) using the out.RAC.bed as input:

p1 <- GuitarPlot(txGTF = "/data/epifluidlab/lulu/m6A_analysis/exompeak2/gencode.v42.primary_assembly.annotation.gtf", 
                stBedFiles = "/data/epifluidlab/lulu/hek293t/3_star/parseBam/out.RAC.bed", 
                headOrtail = FALSE,
                enableCI = FALSE, 
                mapFilterTranscript = TRUE, 
                pltTxType = c("mrna"),
                stGroupName = "yth")
ggsave(p1,filename = 'yth.png')

image

Does this plot make sense to you?

Homer uses the UCSC notations (chr1...), so I just use the out.RAC.bed or the out.bed (no RAC motif enrich) files as input. Then I successfully got the plots HomerRAC

HomernoRACenrich

In addition, I have another new question. I loaded the bam files (Dmut2Aligned.sortedByCoord.out.bam, Dmut3Aligned.sortedByCoord.out.bam, Dyth1Aligned.sortedByCoord.out.bam, Dyth2Aligned.sortedByCoord.out.bam, Dyth3Aligned.sortedByCoord.out.bam) which I got from the STAR alignment in the IGV browser, and searched for the ACTB gene, I could tell that there were some differences between yth files and mut files, and then I zoomed in, and clicked on one of the different positions, and got the percentage of the A, C, T, G, no U. my question is how should I observe the C to U using the IGV browser? ACTB

ACTB2

mflamand commented 1 year ago

Hi,

For metaplotR, hg38 can be used. The problem is with the alternate and fixed chromosome sequences. If there is a mismatch between the provided genome.fasta file and the genepred file, there will be an error. For this kind of analysis, it can be best to use the "analysis set" genome files from UCSC which excludes those sequences (which can also interfere with alignement in RNA-seq by causing multi mapping). For the genepred file, you can also add a filter in the UCSC table browser when generating the genepred file. for example, in the optional section, filter: "create" then: chrom doesn't match _

This would remove all lines with an underscore (_) in the chromosome name and would remove these from the final file. It should fix your problem.

Otherwise, the output from guitar plot looks fine to me with a nice enrichment near the stop codon.

The motifs in Homer looks OK, you could get something even better by trying more stringent size/length parameter are using a background list of shuffled sites.

For IGV, I think it looks great. A few things to keep in mind, IGV is a genome browser, it will only show DNA bases, but we can assume that for RNA-seq, a "T" means a "U". Moreover, ActB is on the reverse strand so C-to-T (or C-to-U when it comes from RNA-se) will appear as G-to-A.

By default IGV shows mutation (in blue/red or green/orange) when there is more than 20% mismatches. you can change this in : View/Preferences/Alignements and then Coverage Track Options: Coverage allele-fraction threshold: default, 0.2 is 20%, you could change to 0.05 for 5%.

Best,

lulumagic7 commented 1 year ago

Hi Morning!

Thanks so much for your detailed explanations!

I downloaded the "analysis set" genome files from UCSC and generated the filtered genepred file.

3

1

2

But still, they didn't match, so I got errors.

Do you still have any idea to fix it?

Best,

Lu

mflamand commented 1 year ago

Hi,

I don't have access to a HPC with the perl packages needed to run metaplotR right now. So I can't test if it is working.However I would make sure of the following:

The filter is doesn't match _, not *

image

You may need to remove the first line of the genepred file which contains the filter used, by default make_annot_bed.pl only removes 1 line (the header):

perl -i.bak -ne 'print $_ unless /^#/' V42.genepred

you can list all the chromosomes in the genepred file:

>cat V42.genepred | cut -f3 | uniq 
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chrM
chrX
chrY
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22 

There need to be a fasta file matching this name exactly in the the genome directory. Indeed, for metaplotR, the -genomeDir option needs to point to a folder containing individual chromosomes.fasta, not a single fasta.

When I get the data from UCSC, i get:

>ls hg38.analysisSet.chroms 
chr1.fa             chr22_KI270732v1_random.fa  chrUn_KI270310v1.fa     chrUn_KI270391v1.fa     chrUn_KI270528v1.fa
chr10.fa            chr22_KI270733v1_random.fa  chrUn_KI270311v1.fa     chrUn_KI270392v1.fa     chrUn_KI270529v1.fa
chr11.fa            chr22_KI270734v1_random.fa  chrUn_KI270312v1.fa     chrUn_KI270393v1.fa     chrUn_KI270530v1.fa
chr11_KI270721v1_random.fa  chr22_KI270735v1_random.fa  chrUn_KI270315v1.fa     chrUn_KI270394v1.fa     chrUn_KI270538v1.fa
chr12.fa            chr22_KI270736v1_random.fa  chrUn_KI270316v1.fa     chrUn_KI270395v1.fa     chrUn_KI270539v1.fa
chr13.fa            chr22_KI270737v1_random.fa  chrUn_KI270317v1.fa     chrUn_KI270396v1.fa     chrUn_KI270544v1.fa
chr14.fa            chr22_KI270738v1_random.fa  chrUn_KI270320v1.fa     chrUn_KI270411v1.fa     chrUn_KI270548v1.fa
chr14_GL000009v2_random.fa  chr22_KI270739v1_random.fa  chrUn_KI270322v1.fa     chrUn_KI270412v1.fa     chrUn_KI270579v1.fa
chr14_GL000194v1_random.fa  chr2_KI270715v1_random.fa   chrUn_KI270329v1.fa     chrUn_KI270414v1.fa     chrUn_KI270580v1.fa
chr14_GL000225v1_random.fa  chr2_KI270716v1_random.fa   chrUn_KI270330v1.fa     chrUn_KI270417v1.fa     chrUn_KI270581v1.fa
chr14_KI270722v1_random.fa  chr3.fa             chrUn_KI270333v1.fa     chrUn_KI270418v1.fa     chrUn_KI270582v1.fa
chr14_KI270723v1_random.fa  chr3_GL000221v1_random.fa   chrUn_KI270334v1.fa     chrUn_KI270419v1.fa     chrUn_KI270583v1.fa
chr14_KI270724v1_random.fa  chr4.fa             chrUn_KI270335v1.fa     chrUn_KI270420v1.fa     chrUn_KI270584v1.fa
chr14_KI270725v1_random.fa  chr4_GL000008v2_random.fa   chrUn_KI270336v1.fa     chrUn_KI270422v1.fa     chrUn_KI270587v1.fa
chr14_KI270726v1_random.fa  chr5.fa             chrUn_KI270337v1.fa     chrUn_KI270423v1.fa     chrUn_KI270588v1.fa
chr15.fa            chr5_GL000208v1_random.fa   chrUn_KI270338v1.fa     chrUn_KI270424v1.fa     chrUn_KI270589v1.fa
chr15_KI270727v1_random.fa  chr6.fa             chrUn_KI270340v1.fa     chrUn_KI270425v1.fa     chrUn_KI270590v1.fa
chr16.fa            chr7.fa             chrUn_KI270362v1.fa     chrUn_KI270429v1.fa     chrUn_KI270591v1.fa
chr16_KI270728v1_random.fa  chr8.fa             chrUn_KI270363v1.fa     chrUn_KI270435v1.fa     chrUn_KI270593v1.fa
chr17.fa            chr9.fa             chrUn_KI270364v1.fa     chrUn_KI270438v1.fa     chrUn_KI270741v1.fa
chr17_GL000205v2_random.fa  chr9_KI270717v1_random.fa   chrUn_KI270366v1.fa     chrUn_KI270442v1.fa     chrUn_KI270742v1.fa
chr17_KI270729v1_random.fa  chr9_KI270718v1_random.fa   chrUn_KI270371v1.fa     chrUn_KI270448v1.fa     chrUn_KI270743v1.fa
chr17_KI270730v1_random.fa  chr9_KI270719v1_random.fa   chrUn_KI270372v1.fa     chrUn_KI270465v1.fa     chrUn_KI270744v1.fa
chr18.fa            chr9_KI270720v1_random.fa   chrUn_KI270373v1.fa     chrUn_KI270466v1.fa     chrUn_KI270745v1.fa
chr19.fa            chrEBV.fa           chrUn_KI270374v1.fa     chrUn_KI270467v1.fa     chrUn_KI270746v1.fa
chr1_KI270706v1_random.fa   chrM.fa             chrUn_KI270375v1.fa     chrUn_KI270468v1.fa     chrUn_KI270747v1.fa
chr1_KI270707v1_random.fa   chrUn_GL000195v1.fa     chrUn_KI270376v1.fa     chrUn_KI270507v1.fa     chrUn_KI270748v1.fa
chr1_KI270708v1_random.fa   chrUn_GL000213v1.fa     chrUn_KI270378v1.fa     chrUn_KI270508v1.fa     chrUn_KI270749v1.fa
chr1_KI270709v1_random.fa   chrUn_GL000214v1.fa     chrUn_KI270379v1.fa     chrUn_KI270509v1.fa     chrUn_KI270750v1.fa
chr1_KI270710v1_random.fa   chrUn_GL000216v2.fa     chrUn_KI270381v1.fa     chrUn_KI270510v1.fa     chrUn_KI270751v1.fa
chr1_KI270711v1_random.fa   chrUn_GL000218v1.fa     chrUn_KI270382v1.fa     chrUn_KI270511v1.fa     chrUn_KI270752v1.fa
chr1_KI270712v1_random.fa   chrUn_GL000219v1.fa     chrUn_KI270383v1.fa     chrUn_KI270512v1.fa     chrUn_KI270753v1.fa
chr1_KI270713v1_random.fa   chrUn_GL000220v1.fa     chrUn_KI270384v1.fa     chrUn_KI270515v1.fa     chrUn_KI270754v1.fa
chr1_KI270714v1_random.fa   chrUn_GL000224v1.fa     chrUn_KI270385v1.fa     chrUn_KI270516v1.fa     chrUn_KI270755v1.fa
chr2.fa             chrUn_GL000226v1.fa     chrUn_KI270386v1.fa     chrUn_KI270517v1.fa     chrUn_KI270756v1.fa
chr20.fa            chrUn_KI270302v1.fa     chrUn_KI270387v1.fa     chrUn_KI270518v1.fa     chrUn_KI270757v1.fa
chr21.fa            chrUn_KI270303v1.fa     chrUn_KI270388v1.fa     chrUn_KI270519v1.fa     chrX.fa
chr22.fa            chrUn_KI270304v1.fa     chrUn_KI270389v1.fa     chrUn_KI270521v1.fa     chrY.fa
chr22_KI270731v1_random.fa  chrUn_KI270305v1.fa     chrUn_KI270390v1.fa     chrUn_KI270522v1.fa     chrY_KI270740v1_random.fa

Since all the chromosomes in the genepred are also in the fasta folder, it should work.

Alternatively, you can use the modified version of the script I just added in the code/script directory: annotate_bed_file_bullseye.pl. This one should work on a both a single fasta file or on a directory containing multiple fasta files. It does work on my current system and the perl libraries used by Bullseye.

lulumagic7 commented 1 year ago

Hey,

I successfully get the plot using metaPlotR under your guidance! and then I input this result into Rstudio, and followed the script you sent to me before. (I got an error as there was no "score", so I deleted the line "mutate(dataset="non_filtered") %>% mutate(score=as.numeric(score)" ). Take a look at this image, gene distribute

I also tried the annotate_bed_file_bullseye.pl under Bullseye, but it seems doesn't work in our system. image should I revise the format to a Perl script like your other Perl scripts?

mflamand commented 1 year ago

Hi Lu, I am glad you have it working.

Indeed those R lines were for comparing different analysis set and I forgot to removed them when I copied it here.

For the perl script, it seems I did something wrong when I uploaded it last time. I have now fixed it. The new file is: make_annot_bed_bullseye.pl in the same directory. This one will also work with the unfiltered genepred file (directly from table browser without filter) so feel free to use in the future if you need to. It seems you have it working anyways.

Best,

lulumagic7 commented 1 year ago

Hey Mflamand,

I just tried the new perl script, and it works! Awesome! Thanks so much for your help and patience!

Best,

Lu

lulumagic7 commented 11 months ago

Hey Mflamand,

Could I ask you another naive question? for the distribution plot, as I posted before, image so I had the (yth) - (yth-mut) plot. is that possible I also could plot out the distribution plot only for (yth-mut)?

Thanks so much!

mflamand commented 11 months ago

Hi,

yes it is possible (and a good control).

I would suggest to identify the sites in your 4th-mut matrix against the genome. Using something like this for each yth_mut file:

perl /usr/local/bullseye/1.0.0/Code/Find_edit_site.pl --annotationFile /data/epifluidlab/lulu/hek293t/4_ref/annotation.refFlat --EditedMatrix Dmut2.matrix.gz --minEdit 5 --maxEdit 90 --MinEditSites 3 --cpu 4 --outfile Dmut2_genome.bed --genome /data/epifluidlab/lulu/gencode/GRCh38.primary_assembly.genome.fa --verbose

you can then find all sites common to both replicates:

summarize_sites.pl --MinRep 2 --mut 3 Dmut2_genome.bed Dmut3_genome.bed > common_yth_mut_sites.bed

or found in any replicate

summarize_sites.pl --MinRep 1 --mut 3 Dmut2_genome.bed Dmut3_genome.bed > unique_yth_mut_sites.bed

and then use the resulting bed file to run metaplotR as above.

lulumagic7 commented 11 months ago

Hi,

I plotted it out using the RAC filtered bed files.

ythmut

Do you think this is a good plot?

I also tried the bed files without RAC filtering, but there seems something wrong with the previous out.bed file. When I tried both Guitar and metaPlotR, I got errors. image

do you have some ideas to fix it?

Thanks a lot!

Best,

Lu

mflamand commented 11 months ago

The shift of edits towards the 3'UTR is what we have seen in the original DART-seq paper, so it looks fine. smoothing on the density plot could be applied to remove the peaks in each curve, which are unlikely to represent real signal.

As for the error, I cannot tell without seeing what is in the original bed file.

Did you run summarize_sites.pl on these files? If not, the header line could be causing this.

Otherwise it seems that bedtools intersect is throwing an error because there is a mismatch between the order of chromosomes the bed file and the hg38_annot.sorted.bed file.

if you run the following, do you get the same chromosome order?

cut -f 1 hg38_annot.sorted.bed| uniq cut -f 1 out.bed file |uniq

If not, you can sort each file with : sort -k1,1 -k2,2n

Alternatively you can run a mock RAC filtering as so:

perl -F'\t' -ane 'if ($_=~/^\#/){next;} if($F[5] =~ /\+/){ $F[1] -= 2; } else { $F[2] += 2; } print join("\t", @F)' out.bed | bedtools getfasta -bedOut -s -fi genome.bed -bed - > out.3nt.sequence.bed

perl -F'\t' -ane 'if($F[5] =~ /\+/){ $F[1] += 2; } else { $F[2] -= 2; } if( 1 ){chomp;pop @F; print join ("\t", @F ), "\n"}' out.3nt.sequence.bed > out.mock.bed

Then the files would be treated exactly the same with or without RAC filtering so it could remove the error.

Best,