markrobinsonuzh / DAMEfinder

Finds DAMEs - Differential Allelicly MEthylated regions
MIT License
10 stars 3 forks source link

Allow non-canonical and non-human genomes #34

Closed franrodalg closed 4 years ago

franrodalg commented 4 years ago

Hi @sorjuela ! Great work!

I've been trying to use DAMEfinder with a "non-canonical" mouse MM10 genome (which includes an rDNA unit not present normally), and I was getting NULL results in each entry of the extract_bams output. I wasn't sure what could be the issue, so I copied and edited the function to debug it. I realised that it only worked if I removed

  1. the build parameter from the readVcf call (lines 18-19):

    vcf <- VariantAnnotation::readVcf(vcfFiles[samp], 
                                      param = param)

    and

  2. the if statement checking the chromosome names (lines 26-31):

      if (length(grep(t, c(as.character(seq(from = 1, to = 22, 
                                           by = 1)), "X", "Y"))) == 0) {
        if (verbose) 
          message("Bad chrom", appendLF = TRUE)
        return(NULL)
      }

What is the purpose of these bits of code? Would it be possible to add a parameter in the function call (e.g., verify=TRUE) to only optionally check the chromosome names?

Thanks!

sorjuela commented 4 years ago

Hi @franrodalg , thank you for your message!

Yeah so I remove "bad chromosomes" just for my own research. I agree, I will add an extra argument for people that don't want to do this.

Regarding the build, does it not work if you specify something like mm10? Otherwise I can also make that optional.

Best, S

franrodalg commented 4 years ago

Thanks for your prompt reply!

Yes, you're right, it works! But I'm not entirely sure it serves any purpose here since readVcf can infer it from the vcf file itself, right? From its documentation, the genome parameter:

A character or Seqinfo object.

  • character: Genome identifier as a single string or named character vector. Names of the character vector correspond to chromosome names in the file. This identifier replaces the genome information in the VCF Seqinfo (i.e., seqinfo(vcf)). When not provided, genome is taken from the VCF file header.
  • Seqinfo: When genome is provided as a Seqinfo it is propagated to the VCF output. If seqinfo information can be obtained from the file, (i.e., seqinfo(scanVcfHeader(fl)) is not empty), the output Seqinfo is a product of merging the two. If a param (i.e., ScanVcfParam) is used in the call to readVcf, the seqlevels of the param ranges must be present in genome.

If you want to keep the build parameter in extract_bams because it helps in your own case, maybe the default option can be changed to an empty string (or ask the user to send an empty string in case they don't want to or cannot specify a particular build), and then change the call to readVcf to be something like:

if (build=='') {
  vcf <- VariantAnnotation::readVcf(vcfFiles[samp], 
                                    param = param)
} else {
  vcf <- VariantAnnotation::readVcf(vcfFiles[samp], 
                                    param = param,
                                    genome = build)
}

In any case, I think it works anyway. It was just its combination with the bad chromosome check that made it fail. Removed the check, it basically doesn't affect. I think it would be good to avoid confusing users, though.

Cheers!

sorjuela commented 4 years ago

Hi, I ended up removing the chromosome filtering and the build argument (PR #35 ). Thanks again ;)

franrodalg commented 4 years ago

Thanks so much, @sorjuela!