mflamand / Bullseye

Bullseye analysis pipeline for DART-seq analysis
MIT License
12 stars 4 forks source link

C2U sites #12

Open mmaitenat opened 1 year ago

mmaitenat commented 1 year ago

Hi there,

I have a question regarding the conversion observed in the output of Bullseye for single-cell data. It might be a silly question, and I might be missing something, but I can't seem to find the answer. In the output of Find_edit_site.pl, I observe that the chr2:121983394 position has a C2U change:

chr2    121983393   121983394   B2m|3UTR|C2U|mut=11|NA  0.6111  +   0   10  0.6111  18  C2U 639938
chr2    121983393   121983394   B2m|3UTR|C2U|mut=11|NA  0.6111  +   0   10  0.6111  18  C2U 720184
chr2    121983393   121983394   B2m|3UTR|C2U|mut=27|NA  0.6923  +   0   10  0.6923  39  C2U 487034
chr2    121983393   121983394   B2m|3UTR|C2U|mut=24|NA  0.7273  +   0   10  0.7273  33  C2U 283074
chr2    121983393   121983394   B2m|3UTR|C2U|mut=20|NA  0.7692  +   0   10  0.7692  26  C2U 158340

However, when I refer to the reference genome I used with parseBAM.pl, I see that I have a T at that position:

$samtools faidx GRCm39.genome.fa chr2:121983393-121983394
>chr2:121983393-121983394
CT

My question is: how can I be observing a C2U when I have a T at the reference position?

Thank you.

mmaitenat commented 1 year ago

Forgot to mention, but it may be relevant. The m6a list was obtained using a YTH-mut sample as the control matrix.

mflamand commented 1 year ago

Hi,

Indeed when comparing to a YTH-mut sample, Find_edit_site.pl is not aware of the sequence at the annotated genome (unless the --fallback option is used and there is not enough coverage in the control YTH-mut sample). Instead, it only checks at the sequence in the control file. So it it possible to see a C2U if a T is found at the reference position, but when there is a mutation/SNP in you cell/mouse line at that position. For example the alignement coverage in the YTH-mut sample may show that a "C" is found instead of a "T".

You can look at the coverage at this position using tabix.

tabix YTH_mut.matrix.gz chr2:121983394

this will output: "chr position #A #T #C #G #G #total" and you can see what is the coverage in your control sample.

If this is because of a known SNP, you can filter them out using the "--filterBed SNP.bed" option using a bed files of annotated SNPs.

Please let me know if that answer your question or if you have another question.

If it doesn't, it may be a bug and I will need to look into it.

Best,

mmaitenat commented 1 year ago

Hi,

Thanks for the detailed response. Indeed, I have checked and there's an indel variant at that position. However, when I look at the coverage of the position in the YTH-mut sample, I can see that there are relatively few reads that read a C. Specifically, 948,084 reads contain a T, and 3,325 reads contain a C, out of a total of 15,434 cells. Similarly, in the YTH sample, 1,416,981 reads show a T, while 4,849 reads show a C. Based on that, I don't think it can confidently be said that there is a C2U at that position, do you?

Could this Bullseye's behavior in calling variants be due to the variability that exists in the YTH-mut cells? Do you think that it would be safer to run Bullseye individually on the YTH and YTH-mut samples, and then manually discard those present in YTH-mut from the YTH sample?

Thank you.

mflamand commented 1 year ago

Hi,

Based on the information you gave me, I would say there is definitely something odd going on and I would not say that there is a C2U at this position.

Could this Bullseye's behavior in calling variants be due to the variability that exists in the YTH-mut cells?

If we look at the results you provided:

chr2    121983393   121983394   B2m|3UTR|C2U|mut=11|NA  0.6111  +   0   10  0.6111  18  C2U 639938
chr2    121983393   121983394   B2m|3UTR|C2U|mut=11|NA  0.6111  +   0   10  0.6111  18  C2U 720184
chr2    121983393   121983394   B2m|3UTR|C2U|mut=27|NA  0.6923  +   0   10  0.6923  39  C2U 487034
chr2    121983393   121983394   B2m|3UTR|C2U|mut=24|NA  0.7273  +   0   10  0.7273  33  C2U 283074
chr2    121983393   121983394   B2m|3UTR|C2U|mut=20|NA  0.7692  +   0   10  0.7692  26  C2U 158340

It seems that there is no coverage in the YTHmut at this position: column 7 is the number of mutations in control sample and column 8 is the coverage (or 10 if it is less then -ControlMinCoverage if specified). However, you say there are lots of reads at that position. This means that there is either a bug here or the control file used was not from a pseudobulk sample. For single single cell analysis, we treated the Control (YTHmut) samples as a "pseudobulk" sample: we ran parseBam.pl in "bulk" mode and not in "SingleCell" mode.

Why? By averaging the coverage and mutations in the Control samples, we remove the cell to cell variability in the YTH-mut (while also loosing the information on each individual cells).

The default behaviour in Find_edit_site.pls is to compare the lines in the DART matrix file with those in the Control matrix file. If there are the same position, it compares coverage, mutations. Then it will move to the next line on the DART matrix only. It will go over each line at this genomic position (each individual cell) until it hits a new genomic position (ie. from chr2:121983394 to chr2:121983395) and then reads the next line in the Control matrix file until it matches the new position in the DART matrix. So this means that it if there are multiple lines in the Control matrix, only the first one will be used. If the Control matrix was processed in "SingleCell" mode, the first line had low coverage (as expected in single cell cell datasets), and Find_edit_sites would assume 0 mutation and 10 coverage as seen above instead. This could explain why it called those sites in some cells. If it had used the pseudobulk, it should have seen: 948,084 (T) / ( 3,325 (C)+ 948,084 (T) ) = 99.6% editing in YTHmut and the sites would have been filtered out. (It would actually have skipped that position)

Can you confirm that a pseudobulk matrix was used? Does the tabix command above reports a single line per position, or multiple lines (one per cell) for the YTHmut?

If you did use the pseudobulk, then there is a bug and I will need to look at in more details. Please let me know.

If you used a single cell dataset instead of pseudobulk, you can use collapse_matrix.pl in /scripts to collapse the matrix to a pseudobulk, or rerun parseBAM.pl in bulk mode.

perl collapse_matrix.pl -i input.matrix.gz -o output.matrix

#the output will be compressed and tabix indexed, do not add .gz in name when running the command

Do you think that it would be safer to run Bullseye individually on the YTH and YTH-mut samples, and then manually discard those present in YTH-mut from the YTH sample?

Running the single cell analysis on both YTH and YTHmut against the genome is not a bad idea. You could also compare single cell YTH and YTHmut to pseudobulk YTHmut and to identify likely background edits. This would allow a more complete picture of editing profiles (% editing, % cells with edits) in both population. Because of the variability in single cell dataset, you would need to choose the good cutoff for removing sites.

I would agree with your statement and with being careful when calling edits. Some sites are more prone to background editing. For example, in human cells we routinely filter out sites obtained from cells expressing Apobec1 alone.

Please let me know if it is working or if you have more questions.