Strand of library - Githubissues

ckuenne commented 2 years ago

hi, this might be a bit of a stupid question: but how does bullseye handle strandedness of the library (first stranded / second strand)? see here: https://chipster.csc.fi/manual/library-type-summary.html

mflamand commented 2 years ago

Hi, The short answer is that it does not. I have kept the parsing of bam files to be done in a similar way that was done in the hyperTRIBE pipeline, which was the inspiration for Bullseye. I have thought of ways to this and may implement this in the future. We simply never had any issues with the way its working now. Let me explain in more detail.

We prepare our libraries using a stranded library prep method, which should result in a good ratio of reads on the correct strand. Bullseye uses samtools view to parse all reads mapped in the genome and construct a matrix of coverage for each position with the number of each nucleotide. Samtools outputs all reads in the same direction (sense), thus this parsing is not stranded. It should be straightforward to split reads based on orientation and add information on strand for each position. This would increase the matrix file size by less than twice. The "strand" decision is taken when the refFlat file is provided. For each transcript in this file, it builds a list of genomic interval for features: CDS, 3UTR, 5UTR, ncRNA, introns, etc depending on option used. and does this for both positive and negative strand. It than looks within these defined intervals how many of each nucleotide are mapped. On the negative strand it reverse complements the coverage. So we are still looking at C-to-U (C-to-T) transitions on either strand, but only when a transcript is annotated at a given position. This can be problematic if 2 RNAs on opposite strands are overlapping. In my experience I did not find instances of sites being found on 2 overlapping highly expressed RNAs. In any case the C-to-T transition can only be in one direction. For example if there is a mutation C-to-T transition on the reverse strand, on an overlapping RNA on the positive strand, we would see an a G-to-A and it would not be called. Of course in this case, the actual frequency of edits would be skewed because we did not keep track of the original strandedness of the read. But as I said these are rare cases. I nonetheless think that keeping track of strand would be an interesting feature to implement.

Does that make sense? Let me know if you have further questions.

ckuenne commented 2 years ago

thanks for taking the time to explain this in detail. as i understand it, bullseye will simply not work with reverse stranded libraries right now. which is really unfortunate, since that is of course the type i have...

but as yous said: this should not be too difficult to implement, since you just have to reverse the mutated base in case of stranded/rev bam files. or change the algorithm to call G>A instead. can you maybe give me a rough timeline when/if you will implement this? because i might have to stick with the old CTK-based pipeline otherwise.

mflamand commented 2 years ago

Maybe I misunderstood, I think that it should work with reverse stranded library as is: the strandedness information is not considered. The mapped position in the genome will be the same no matter the strandedness and is reported 5'-3' in the bam file. It will be important to know strandedness when I implement a version that keeps sense/antisense information. I will look into this at the end of next week.

ckuenne commented 2 years ago

ok, now i got it. took a moment, sorry. you basically just use the reads to find the genomic position and take everything else from the fasta and the annotation in the refFlat. so the actual read sequence or strand is not used, just chr/start/stop of the alignment. that should be fine then. i'll try it with some real data and see how it behaves compared against the old pipeline.

just out of curiosity: do you know about the jacusa tool (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1432-8)? it's for tribe, not dart-seq, so not directly applicable here. but they use the read sequence itself and also try to take into account the location of the mutation inside the read, polymorphisms, alleles, etc.

mflamand commented 2 years ago

Yes you are right, a nuance is that the sequence (from fasta) is parsed to remove soft clipped nucleotides and indels based on the cigar string. The sequence is also split when spanning introns.

I probably have read this paper in the past. I think it has good ideas that may be worth implementing in the future. I am already using filtering based on base quality in another pipeline and I can port it to Bullseye. Filtering or weighing based on homopolymetic sequences or Indels would also be interesting. For human data, I usually filter sites overlapping known SNP using the -filterBed SNP.bed when using Find_edit_sites. Unfortunately I don't know any Java so I can't adapt it directly, but I can try to use the same principles

ckuenne commented 2 years ago

good suggestion with the snp removal. will try that.

i now ran bullseye with a real dataset and get around 50% overlap with the old results. but that of course heavily depends on parameters. so far so good. :+1:

i will now close this issue.

mflamand / Bullseye

Strand of library #4