gstricker / GenoGAM

http://bioconductor.org/packages/devel/bioc/html/GenoGAM.html
7 stars 4 forks source link

Error with overhangSize #5

Open lauzikaite opened 3 years ago

lauzikaite commented 3 years ago

Hi,

I've been trying to run GenoGAMDataSet on a bunch of BAMs, but no matter what chunkSize and overhangSize parameters I choose (inc. the defaults), it stops with an error:

Error in .makeTiles(l) : Tile size must be greater than twice the overhang size. Check your chromosome lengths. Your smallest chromosome might be smaller than 2 * overhangSize

Can you perhaps point me in the direction of my mistake?

Version GenoGAM_2.4.0

Thanks so much!

gstricker commented 3 years ago

Hi,

could you please provide me with the code that you use to prepare the parameters for GenoGAMDataSet and also your config file?

Thanks!

lauzikaite commented 3 years ago

Thanks for your reply!

I've played with the parameters more and now got through this issue. However, GenoGAMDataSet now stops with another issue:

BiocParallel::register(BiocParallel::MulticoreParam(worker = 10))
expDesign <- file.path(main_dir, "samplemap_genoGAM_test.txt")
wd_folder <- file.path(main_dir, "subsample")
design <- ~ s(x) + s(x, by = experiment)
chunkSize <- 1000
overhangSize <- 200
ggd <- tryCatch(expr = {
  GenoGAMDataSet(
  experimentDesign =  expDesign,
  directory = wd_folder,
  chunkSize = chunkSize,
  overhangSize = overhangSize,
  design = design,
  hdf5 = FALSE)},
  error = function(e){    
    message("The error message:")
    message(e)
  },
  warning = function(w){  
    message("The warning message:")
    message(w)
  })
INFO [2020-11-11 11:17:11] Creating GenoGAMDataSet
INFO [2020-11-11 11:17:36] Reading in data
INFO [2020-11-11 11:17:36] Reading in H9P51_E12s_1
INFO [2020-11-11 11:17:37] Reading in H9P51_E12s_0
INFO [2020-11-11 11:17:37] Finished reading in data
INFO [2020-11-11 11:17:39] Reading in data
INFO [2020-11-11 11:17:39] Reading in H9P51_E12s_1
INFO [2020-11-11 11:17:39] Reading in H9P51_E12s_0
INFO [2020-11-11 11:17:40] Finished reading in data
INFO [2020-11-11 11:17:40] Reading in data
INFO [2020-11-11 11:17:40] Reading in H9P51_E12s_1
INFO [2020-11-11 11:17:41] Reading in H9P51_E12s_0
INFO [2020-11-11 11:17:41] Finished reading in data
The error message:
stop worker failed:
  attempt to select less than one element in OneIndex

I also attach the config file. samplemap_genoGAM_test.txt

I tried BAMs with both numeric chromosome notation and with a "chr" prefix, as notations commonly cause issues with other software, but it didn't have an effect.

A view into one of the BAMs:

$ samtools view SLX-19511.NXlong028.HTLYTDRXX.s_1.sorted.bam | head
A00489:637:HTLYTDRXX:1:1166:5150:12320  163 1   10054   0   50M =   10177   173 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F  NH:i:9  HI:i:1  AS:i:98 nM:i:0
A00489:637:HTLYTDRXX:1:1146:25554:29653 163 1   10056   0   50M =   10176   168 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFF:F,::  NH:i:10 HI:i:1  AS:i:96 nM:i:0
A00489:637:HTLYTDRXX:1:1259:17607:24064 163 1   10057   0   50M =   10176   169 ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:,FFF:F,,F,  NH:i:10 HI:i:1  AS:i:98 nM:i:0
A00489:637:HTLYTDRXX:1:2270:12744:22623 163 1   10057   0   50M =   10066   59  ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  NH:i:10 HI:i:1  AS:i:99 nM:i:0
A00489:637:HTLYTDRXX:1:1271:6479:31970  99  1   10058   0   50M =   10073   65  CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC  F:FFFFFFFFF:FFFFFFFFFFFF:FFFFF:F::FFFFF,:::FF::FFF  NH:i:10 HI:i:1  AS:i:96 nM:i:1
A00489:637:HTLYTDRXX:1:2273:3486:28917  99  1   10058   0   50M =   10163   155 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC  FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF,FF:FF,FFFF  NH:i:10 HI:i:1  AS:i:98 nM:i:0
A00489:637:HTLYTDRXX:1:1158:21423:18865 99  1   10059   0   50M =   10073   64  CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFFF,FFFFF  NH:i:10 HI:i:1  AS:i:98 nM:i:0
A00489:637:HTLYTDRXX:1:2242:2211:35916  163 1   10060   255 50M =   10176   166 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCA  FFFFFFF:FFFFFFFFFF:FFFFFFF:FFFFFFFFFFFFFFFF:F,FFF,  NH:i:1  HI:i:1  AS:i:98 nM:i:0
A00489:637:HTLYTDRXX:1:2259:25220:29168 163 1   10063   255 48M2S   =   10351   329 ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACGCAAGA  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:F,F,,,F,,  NH:i:1  HI:i:1  AS:i:85 nM:i:1
A00489:637:HTLYTDRXX:1:2217:15745:23688 99  1   10064   255 50M =   10090   76  CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACC  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:::FFFF,FF,FF,F:FF  NH:i:1  HI:i:1  AS:i:94 nM:i:2

I'd appreciate any pointers as to why I'm having this issue. Thanks!

gstricker commented 3 years ago

Hi,

I have a few ideas what might be wrong.

  1. Looking at your current GenoGAMDataSet construction I notice that your chunkSize is rather small. For better performance it should be somewhere in the 10ks (e.g. 50k as in the example). If you have trouble increasing it (I believe that was the problem in the initial question) you have to check if you might have any small chromosome in your BAM file. Sometimes Mitochondria chromosomes are included, which might be smaller than the actual chunkSize or 2*overhangSize

  2. The second issue seems to be an error in parallelization, specifically a problem in selecting a job from the indexed job list. What is also strange that it seem to be reading in the same two files multiple times. Usually it should print out the chromosome names it's reading multiple times from a file. Could you put the logger in DEBUG mode to see some more messages and switch off parallelization for a moment and run just the construction of the GenoGAMDataSet again? That is before running the function put

futile.logger::flog.threshold(DEBUG) and BiocParallel::register(BiocParallel::SerialParam())

What organisms are you working on?