mflamand / Bullseye

Bullseye analysis pipeline for DART-seq analysis
MIT License
12 stars 4 forks source link

Bulk Dartseq results from paper not reproducable #3

Closed pschinke closed 2 years ago

pschinke commented 2 years ago

Hello Mathieu,

I am trying to reproduce the analysis of the Bulk-Dartseq data from the bullseye paper, but I'm not getting good results (no overlapping sites between my results and the GSE180954_Bulk_YTHmut_RACsites.bed file).

This is my workflow:

Downloading files from ebi

Trimming adapters using flexbar (first read with "AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC", second read with "AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC") according to https://www.neb.com/faqs/2017/07/17/how-should-my-nebnext-small-rna-library-be-trimmed flexbar -r wt3_1.fastq -p wt3_2.fastq -a a1.fasta -a2 a2.fasta -ap ON

Alignment using hg38 index created with GENCODE genome and GTF:

STAR --runThreadN 20 --runMode alignReads --genomeDir starindex --readFilesIn wt3_1.fastq wt3_2.fastq --sjdbGTFfile starindex/gencode.v39.annotation.gtf

Sorting, indexing, removing duplicates

samtools sort -n -o alnWT3.name.bam alnWT3.sam

samtools fixmate -r -m alnWT3.name.bam alnWT3.fix.bam

samtools sort alnWT3.fix.bam -o alnWT3.fix.sorted.bam

samtools markdup -r alnWT3.fix.sorted.bam alnWT3.col.bam

Creating coverage matrix

perl parseBAM.pl --input alnWT3.col.bam --output WT3.matrix --verbose --cpu 12 --removeDuplicates --minCoverage 10

Finding edit sites

perl /home/icb/patrick.schinke/software/Bullseye/Code/Find_edit_site.pl --annotationFile hg38RefFlat --EditedMatrix WT1.matrix.gz --controlMatrix Mut3.matrix.gz --editType C2U --minEdit 10 --maxEdit 80 --editFoldThreshold 1.5 --EditedMinCoverage 10 --ControlMinCoverage 10 --MinEditSites 2 --outfile Rep1Mut3.txt --verbose --cpu 10

(I'm running this 3 times for 1. Wild Type 1 against Mutant 1 Mutant 2 Mutant 3, 2. Wild Type 2 against Mutant 1 Mutant 2 Mutant 3, 3. Wild Type 3 against Mutant 1 Mutant 2 Mutant 3)

Summarizing sites:

perl summarize_sites.pl --minRep 2 1.bed 2.bed 3.bed > SumReps.bed

Only keeping locus columns and calculating common sites (I get zero)

bedtools intersect -s -wa -wb -a GSE180954_Bulk_YTHmut_RACsites.bed -b SumReps.cut.bed | wc -l

Am I making a mistake or missing some important step? There likely are problems already within the alignment, because for some ground truth sites I don't have any coverage in my final bam files (especially in sites on chromosome X) Any help would be super appreciated

mflamand commented 2 years ago

Hi Patrick,

I am not the one who analyzed this dataset nor added the GEO dataset, if you have specific questions about the parameters used I would encourage you to contact the first author on the paper. Nonetheless I think I can help you for the some of questions here.

1) The GENCODE genome assembly has its assembly names as chrN. I think the bed file on GEO was obtained from an ensembl genome alignment with assembly without the 'chr'. for bedtools you should convert one of them so they have the same naming scheme. This can be done using sed,awk,perl or your editor of choice. Let me know if you need help with that.

2) The bed file GSE180954_Bulk_YTHmut_RACsites.bed is for sites that were identified only in the negative control (YTHmut) with a comparison to genome. You should not expect to see overlap with this file using bedtools intersect, as these sites should be mostly filtered out. On GEO, I believe the files WTC12_A_m6Asites.bed.gz, WTC12_B_m6Asites.bed.gz and WTC12_C_m6Asites.bed.gz are the sites identified in each replicates, in cell expressing the DART construct against the YTHmut. In the manuscript, Table S1 contains the list of sites identified in at least 2/3 replicates as an excel document. If you remove the first 2 lines and save it as a tsv document, you will have the bed file.

3) This is odd about the chrX, are you saying there is no coverage in the bam file or no coverage in the matrix file? If there is a problem I would definitely want to fix it.

pschinke commented 2 years ago

Thanks a lot for your answers!

  1. Yes, I fixed that with awk.
  2. That explains a lot haha. Now I'm using the correct file and getting much better results
  3. It's the coverage in the bam files, not the matrix. There doesn't seem to be a problem in the pipeline.