smithlabcode / methpipe

A pipeline for analyzing DNA methylation data from bisulfite sequencing.
http://smithlabresearch.org/methpipe
67 stars 27 forks source link

Issue with mlml: `error: chrom or pos mismatch in both files` #185

Closed PanZiwei closed 3 years ago

PanZiwei commented 3 years ago

To whom it may concern, I am trying to use mlml to analyze oxBS-seq and BS-seq data. But I encountered the error saying error: chrom or pos mismatch in both files. So is there a way to solved the problem since my oxBS-seq is a reduced representation oxBS-seq(RRoxBS-seq)?

(base) [c-panz@sumner-log1 APL_BSseq]$ head -5 APL.oxbs.test.mlml.bed
chr1    10542   +   CpG 1.0 1
chr1    10563   +   CpG 1.0 1
chr1    10571   +   CpG 1.0 1
chr1    10577   +   CpG 1.0 2
chr1    10579   +   CpG 1.0 2
(base) [c-panz@sumner-log1 APL_BSseq]$ head -10 APL.bs.test.mlml.bed
chr1    10484   +   CpG 1.0 1
chr1    10489   +   CpG 1.0 1
chr1    10493   +   CpG 1.0 1
chr1    10497   +   CpG 1.0 1
chr1    10525   +   CpG 1.0 1
chr1    10542   +   CpG 1.0 2
chr1    10563   +   CpG 0.6666666666666666  3
chr1    10571   +   CpG 1.0 3
chr1    10577   +   CpG 1.0 3
chr1    10579   +   CpG 1.0 3
(base) [c-panz@sumner-log1 APL_BSseq]$ $software_path/mlml -u $file_path/APL.bs.test.mlml.bed -m $file_path/APL.oxbs.test.mlml.bed -o $file_path/test.tsv
error: chrom or pos mismatch in both files

Also, if the strand is "-", will the 4th column be GpC instead of CpG?

Thank you so much for your help!

guilhermesena1 commented 3 years ago

Hello,

It looks like from the first 10 lines of your inputs, there are some CpGs that appear in one dataset but not in the other. Were these two files generated through methcounts followed by symmetric-cpgs? And was the methcounts program used for both samples on the same reference genome? We can try to pinpoint the source of the problem by looking at the number of lines generated at each step. Every output of methcounts should have the same number of lines for the same reference genome (one for each cytosine), and the same should be true for symmetric-cpgs. Also, could you confirm that the BED files were not filtered (for example, removing lines with zero CpG counts)?

Regarding the question on strand, since CpGs are palindromic, they will also show as CpGs on the negative strand, however, the symmetric-cpgs program assumes that both strands have similar methylation and sums them up, thus keeping only CpGs in the + strand.

Thank you!

PanZiwei commented 3 years ago

Thank you so much for your explanation. So you are suggesting that I should follow the tutorial to use methpipe from the very first beginning so that I can correctly use mlml? Apart from the tutorial and http://smithlabresearch.org/software/methpipe/, do you have a detailed Jupyter notebook/manual for example usage?

Does mlml accept .bismark.gov files? Right now the files are not generated by methcounts. Instead, I used bismark for alignment, methylation calling, and deduplicate steps. To be more specific, I first used bismark to generate a Bismark Coverage report in a GZIP compressed format (*.bismark.cov.gz) with 6 columns (See explanation https://support.illumina.com/help/BaseSpace_App_MethylSeq_help/Content/Vault/Informatics/Sequencing_Analysis/Apps/swSEQ_mAPP_MethylSeq_Bismark_Coverage.htm):

(base) [c-panz@sumner-log2 APL_BSseq]$ zcat APL-bs_R1_val_1_bismark_bt2_pe.deduplicated.sorted.bismark.cov.gz | head -3
chr1    10484   10484   100 1   0
chr1    10489   10489   100 1   0
chr1    10493   10493   100 1   0

Then I extracted specific columns and add "CpG" and "+" columns to get the input for mlml. But I am a little worried about the strand issue for the CpG part.

guilhermesena1 commented 3 years ago

I would strongly suggest you use the pipeline starting from your bismark SAM output up to the mlml program, it will probably be faster than trying to manually convert between file formats. You would only have to run to-mr to convert your bismark SAM file to MR, sort and remove duplicates from the resulting MR, run methcounts, symmetric-cpgs then mlml. Each program is a single command.

Methpipe expects file formats to be the outputs of its own programs, and the only files it converts are SAM/BAM files generated by other mappers. Each methylation analysis pipeline will usually operate like this, since they often differ between what kinds of outputs they produce, it is uncommon to see "cross-talk" between pipeline file formats, at least as far as I'm aware of, so your best bet is usually to commit to a single pipeline.

In any case, whichever direction you decide to try running mlml, please let us know if you were able to run it to completion or which errors you bump into.

PanZiwei commented 3 years ago

Thank you so much for your help! I have two additional questions:

  1. Should all the 5hmC/5mC/5C levels >=0? I checked the mlml result I got from the BS-seq/oxBSseq without convergence tolerance, I found that the min value of 5hmC column is -1.11022e-16. Is it normal to have such a tiny negative value for 5hmC?
  2. Is it necessary to set up convergence tolerance? Do you have any recommendations for the cut-off? Also, do you did an extra filter step(e.g. filter out the sites with low coverage from BS-seq/oxBS-seq) before the mlml program?

Thank you so much for your help!