nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
139 stars 8 forks source link

thread 'main' panicked, "index out of bounds: the len is 10 but the index is 10" #244

Open lucy924 opened 2 months ago

lucy924 commented 2 months ago

Hi, I've had this issue cropping up in some of my dmr's but not consistently. I'm looping through each of my paired samples and each chromosome reference (human T2T), with different pileups (one using the -traditional preset and another using -cpg. The samples are of differently treated cancer cell lines that have been whole genome sequenced.

experiment names: B3, C3, B4, B5, C5, B6, C6

Problem pileup, pair, chromosome so far:

trad,C6,chr13
trad,C5,chrY
trad,C6,chrY
trad,B3,chrM
trad,C3,chrM
trad,C5,chrM
trad,B6,chrM
trad,C6,chrM
mCG,C6,chr6
mCG,B3,chr10
mCG,B4,chr10
mCG,B6,chr10
mCG,B5,chr11
mCG,B6,chr12
mCG,B5,chr14

Do you think rerunning these specific pileups will help?

Pileup code

# traditional
modkit pileup ${path2reads_bam} ${pileup_bed} \
        --ref $reference \
        --preset traditional \
        --threads 10 \
        --log-filepath $log_file

# CG pileup
modkit pileup ${path2reads_bam} ${pileup_bed} \
        --ref $reference \
        --cpg \
        --threads 10 \
        --log-filepath $log_file

DMR code

modkit dmr pair \
        -a "${control_bed}" \
        --index-a "${control_bed}.tbi" \
        -b "${test_bed}" \
        --index-b "${test_bed}.tbi" \
        -o "${dmr_result}" \
        --header \
        --ref "${reference}" \
        --base C \
        --threads 9 \
        --log-filepath "${dmr_log}" \
        -f \
        --min-valid-coverage 20 \
        --delta 0.1 

Maybe related issue I'm not sure if this next issue related but I'm also getting a different error with other pairs as well:

> sampled 46 a records and 0 b records, calculating max coverages for 95th percentile
> Error! not enough data points (got 0) to calculate percentile

This is the list of experiments with not enough datapoints

trad,B3,chrY
trad,B4,chrY
trad,B5,chrY
trad,B6,chrY
mCG,B6,chr11
mCG,B6,chr14
mCG,B3,chrY
mCG,B4,chrY
mCG,B5,chrY
mCG,C5,chrY
mCG,B6,chrY
mCG,C6,chrY
mCG,B4,chr14

I initially didn't think this was a problem, (especially as the B experiments are a female derived cell line) but then I started getting the same error on autosomal chromosomes too, in which the alignment depth might not be great but I didn't think it was this bad.

Appreciate any insights!

ArtRand commented 2 months ago

Hello @lucy924,

Could you send me the whole panic message?

lucy924 commented 2 months ago

Hi @ArtRand Here it is:

- performing DMR on methylated CpG pileups, single site
> reading reference FASTA at "/home/dejlu879/refs/hg38_chromsonly_ebv_lambda_bcg/chr15.fa"
> running single-site analysis
> using default prior, Beta(α: 0.55, β: 0.55)
> estimating max coverages from data
> sampled 12 a records and 1502 b records, calculating max coverages for 95th percentile
thread 'main' panicked at src/thresholds.rs:39:22:
index out of bounds: the len is 12 but the index is 12
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
ArtRand commented 2 months ago

Hello @lucy924, thanks for sending the traceback.

I'll have a build to you with the fix asap.

ArtRand commented 2 months ago

Hello @lucy924,

What you're doing looks reasonable to me. With respect to this error:

> sampled 46 a records and 0 b records, calculating max coverages for 95th percentile
> Error! not enough data points (got 0) to calculate percentile

If for some reason bedMethyl records from the 'b' sample are systematically failing you cold see zero records being sampled and these errors will be in the log. Looks like you have --log-filepath "${dmr_log}" set, could you tell me if there are error messages in there? They would look like "batch failed during sampling.." or "failed to get coverages...". You can work-around this by passing --max-coverages option, I would pick a number that's roughly what coverage you expect. If you have less coverage than the max, that's fine. Otherwise I'm a little curious how you're getting zero records for one of the samples, is it possible that your tabix index has the contig listed, but there are in fact zero records for that contig? Maybe checking tabix -l "${test_bed}" and see if it makes sense.

lucy924 commented 2 months ago

Hi @ArtRand

I went through a bunch of the logs and none of them give any actual error messages, they just stop with the last line sampled 46 a records and 0 b records, calculating max coverages for 95th percentile. I'm not sure if I've missed an error log output somewhere? The output of tabix -l "${test_bed}" is as expected, with only one contig (I've been aligning each chromosome separately for reasons), and I checked that the bed files do actually contain records, so it's not that they are empty! I have attached a few log files if they are of any use, all of one pair of samples (B6) on chromosome 14 that had a some dmr runs work and some that didn't.

File Status Description
B6_pair.chr14.6mA_dmr.v1ss.bed.log DMR failed DMR using pileups looking for the 6mA modification
B6_pair.chr14.trad_dmr.v1ss.bed.log DMR successful DMR using pileups made with the traditional preset
B6_Control.chr14.6mA_pileup.log Fine Associated DMR found 0 records (see above)
B6_Control.chr14.mCG_pileup.log Fine Associated DMR may have found 0 records
B6_Test.chr14.mCG_pileup.log Fine Associated DMR may have found 0 records
B6_Control.chr14.trad_pileup.log Fine Associated DMR was successful (see above)

Unfortunately the mCG DMR logs were deleted somewhere along the way, but if you need them I can rerun it at some point.

debugging.zip

ArtRand commented 2 months ago

Hello @lucy924,

Thanks for sending the logs over. Do other chromosomes 6mA DMR seem to work and just chr14 is having problems? One thing I noticed is that you use --motif A 0 --motif T 0 I assume to capture the complement T? You don't actually need to do this --motif A 0 will capture (+)-strand As and (-)-strand T positions. If you short-circuit this estimation step by passing --max-coverages to the dmr pair for chr14 6mA, does it complete successfully?

Sorry for the somewhat cryptic asks, I have a theory for what's going on, but I need a little more information.

Thanks.

lucy924 commented 2 months ago

Hi @ArtRand

Thanks for the clarification on the A/T - I wasn't entirely sure from the docs so put T down just in case, figuring it wouldn't do any harm.

Other than chrY, I had the same problem with one sample, chr11, mCG, DMR, but not with 6mA DMR. Chr14 was having problems with multiple samples on mCG, and one of those samples on 6mA.

I set --max-coverages to 100 and reran all samples/all chroms, and chr14 didn't error, but it didn't work either:

On mCG

> finished, processed 0 sites successfully, 0 failed

6mA is behaving weirdly on all samples this time, and I don't think I've done anything different. Unfortunately I don't have time this week to delve into it more, I'll let you know if I figure out what happened.

I've attached example outputs for both mCG and 6mA, hopefully it helps ...

I managed to get good outputs using the below (both B and C sets):

    modkit dmr pair \
    -a "${control_bed_C3}" \
    -a "${control_bed_C5}" \
    -a "${control_bed_C6}" \
    -b "${test_bed_C3}" \
    -b "${test_bed_C5}" \
    -b "${test_bed_C6}" \
    -o "${dmr_result}" \
    --header \
    --ref "${reference}" \
    --segment "${dmr_segment}" \
    --base C \
    --base A \
    --threads 20 \
    --log-filepath "${dmr_log}" \
    -f \
    --min-valid-coverage 20 \
    --delta 0.1

And I'm running with this data for now, do you think it likely that issues happened with the same samples on chr14 and the problems are just hidden in this case?

Thanks, Lucy

example_outputs.txt

ArtRand commented 1 month ago

Hello @lucy924,

I've made a refactor to how the tabix indices are handled in dmr. Please find a build attached. Hopefully with this build the outputs from the various runs become consistent. Thanks,

A modkit_u16_x86_64.tar.gz

lucy924 commented 1 month ago

Hi @ArtRand Thanks for that, I'll test it out when I have the chance. Much appreciated!