maxplanck-ie / snakepipes

Customizable workflows based on snakemake and python for the analysis of NGS data
http://snakepipes.readthedocs.io
MIT License
374 stars 85 forks source link

WGBS produceReport: Error in `read.bismark()` #938

Closed hanrong498 closed 9 months ago

hanrong498 commented 9 months ago

Hi,

I can run WGBS pipeline to the every last step of produceReport. However, I got the error from the read.bismark():

Quitting from lines 197-212 [unnamed-chunk-10] (tmp57km4vvf.WGBS_QC_report_template.Rmd)
Error in `read.bismark()`:
! Could not guess Bismark file type for these files:

Backtrace:
 1. bsseq::read.bismark(snakemake@input[["bedGraphs"]], BACKEND = "HDF5Array")

Referring to the source code, I want to ask what the input file to this function is. If it is the .bedGraph file, I think it is not supported by the read.bismark() function:

> library(bsseq)
> ?read.bismark

File formats:

     The format of each file is automatically detected using the
     internal function ‘bsseq:::.guessBismarkFileType()’.  Files ending
     in ‘.gz’, ‘.bz2’, ‘.xz’, or ‘.zip’ will be automatically
     decompressed to ‘tempdir()’.

  Supported file formats:

       Bismark's 'genome wide cytosine report' (<URL:
       https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-genome-wide-cytosine-report-optional-is-tab-delimited-in-the-following-format-1-based-coords>)
       and 'coverage' (<URL:
       https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-coverage-output-looks-like-this-tab-delimited-1-based-genomic-coords>)
       formats are both supported.  If setting ‘loci = NULL’, then we
       strongly recommend using the 'genome wide cytosine report'
       output format because this includes strand information for each
       locus.  The 'coverage' output does not contain strand
       information and so the ‘strand’ of the returned BSseq object
       will be set to ‘*’ unless stranded ‘loci’ are supplied.

  Unsupported file formats:

       Neither the 'bedGraph' output format (<URL:
       https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-bedgraph-output-optional-looks-like-this-tab-delimited-0-based-start-1-based-end-coords>)
       nor the 'bismark_methylation_extractor' output format (<URL:
       https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-bismark_methylation_extractor-output-is-in-the-form-tab-delimited-1-based-coords>)
       are supported.  The former does not include the required counts
       of methylated and unmethylated reads hile the is an intermediate
       file containing read-level, rather than locus-level, data on
       methylation.

Thanks for your help in advance!

katsikora commented 9 months ago

Hi @hanrong498 ,

right, it looks like the function would take the bedGraph files produced by MethylDackel as input. I believe MethylDackel should produce bedGraphs that are similar to bismark 'coverage' output. I'll see if I can reproduce the issue. Could you check which bsseq version you have in your WGBS conda env?

snakePipes envInfo should return paths to envs. Then something like conda list -p $path_to_wgbs.yaml_env | grep bsseq should return the information containing bsseq version.

Best wishes,

Katarzyna

hanrong498 commented 9 months ago

Hi Katarzyna,

Thanks for the reply!

### envs/wgbs.yaml
(snakePipes) [hhu@jed upzenk]$ conda list -p /work/upzenk/miniconda3/envs/20d3d6a39823b295d3503a5332b4207b | grep bsseq
bioconductor-bsseq        1.30.0            r41hc247a5b_2    bioconda
### envs/wgbs_dss.yaml
(snakePipes) [hhu@jed upzenk]$ conda list -p /work/upzenk/miniconda3/envs/9fc99f8a3f9c3ab3f5c5cfd619ef5d15 | grep bsseq
bioconductor-bsseq        1.36.0            r43hf17093f_0    bioconda

I am not sure if the wgbs_dss.yaml is also relevant but I still listed it here. Thanks so much!

Best, Hanrong

katsikora commented 9 months ago

Hi Hanrong,

I wasn't able to reproduce your error with snakePipes 2.7.3 on freshly made conda envs. I have obtained exactly the same package version and build for bsseq in the WGBS env as you have: bioconductor-bsseq 1.30.0 r41hc247a5b_2 bioconda

I have a different version than you in the WGBS_DSS env: bioconductor-bsseq 1.18.0 r351hf484d3e_0 bioconda

But, the rule that produces the QC report is using CONDA_WGBS_ENV, which is built using wgbs.yaml, not wgbs_dss.yaml.

Also, I was able to confirm that: a<-bsseq::read.bismark("/test_results/MethylDackel/SRX202087_CpG.bedGraph", BACKEND="HDF5Array") works in 4.1.3 with `bsseq 1.30.0.

Best, Katarzyna

hanrong498 commented 9 months ago

Hi again!

Thank you very much for looking into details. I think the problem is that I am running it in a subset sample (1000 reads). The CpGCoverage is then too low, which makes the .bedGraph an empty output.

I'll close this issue for now. I will try with the whole dataset and hope it works.

Thank you so much!

Best, Hanrong