roland-rad-lab / MoCaSeq

Analysis pipelines for cancer genome sequencing in mice.
Other
19 stars 14 forks source link

Delly filtered data #7

Closed laugon10 closed 3 years ago

laugon10 commented 3 years ago

Hi! I am having some issues with the Delly and Chromotripsis detection. I have done al the steps in the interactive mode and everything seems ok until I get to the $name.breakpoints.filtered.tab which only contains NAs. I have started already from bams, so I have not done all the steps in the pipeline. Maybe it is something related to it. I would really appreciate some advice.

Thank you very much, Laura

NikdAK commented 3 years ago

Hi Laura, I recently fixed this bug, which is related to a Delly version update resulting in changed IO formats.

Could you please try the code in this development repository and check if it works? https://github.com/roland-rad-lab/MoCaSeq/tree/human-pipeline/repository

The most important changes are Chromothripsis_Delly_Annotate-Filter.R as well as lines 1306-1396 in the main script.

Good luck

laugon10 commented 3 years ago

Hi, I have tried with this version. And now I obtained an empty filtered file. Are you sure it is compatible with the old version of mocaseq? Should I re-run something?

Thank you very much, Laura

NikdAK commented 3 years ago

It should be compatible with the old version, but I did not fully test it.

The new code should look like this:


coverage=$(sh $repository_dir/Chromothripsis_GetCoverage.sh $name)

java -jar $snpeff_dir/SnpSift.jar extractFields \
$name/results/Delly/$name.pre.vcf \
CHROM POS CHR2 POS2 END PE SR PRECISE IMPRECISE \
GEN[Tumor].DR GEN[Tumor].CN GEN[Tumor].RR GEN[Tumor].RV \
GEN[Normal].DR GEN[Normal].CN GEN[Normal].RR GEN[Normal].RV \
MAPQ CT \
> /var/pipeline/${name}/results/Delly/${name}.breakpoints.tab

Rscript $repository_dir/Chromothripsis_Delly_Annotate-Filter.R \
-n $name \
-i $name/results/Delly/$name.breakpoints.tab \
-o T \
-c $coverage \
-v 0.2 \
-d 6000

Which version of Delly and MoCaSeq are you running?

Is the unfiltered file (Delly/NAME.breakpoints.tab) also empty? If not, could you please show me the first few lines of that file?

laugon10 commented 3 years ago

Hi, Yes, I tried exactly those lines. I just corrected the coverage to 33 since I am not doing the full mocaseq pipe.

Delly2 v0.8.3. For Mocaseq I installed it with the latest -> I have not been able to find the version in any file. But since I have install it in October 2020 I guess it is v0.4.53 (master).

This is the beginning of the Delly/name.breakpoints.tab file:

CHROM   POS     CHR2    POS2    END     PE      SR      PRECISE IMPRECISE       GEN[Tumor].DR   GEN[Tumor].CN   GEN[Tumor].RR   GEN[Tumor].RV   GEN[Normal].DR  GEN[Normal].CN  GEN[Normal].RR  GEN[Normal].RV     MAPQ    CT
1       3003408                 81643743        4               false   true    22      4       0       0       28      4       0       0       1       5to3
1       3003441                 3799048 2               false   true    28      4       0       0       21      4       0       0       1       5to3
1       3153301                 153080616       2               false   true    31      6       0       0       31      7       0       0       1       3to3
1       3232541                 7985537 6               false   true    32      4       0       0       40      4       0       0       1       5to3
1       3233304                 100472635       6               false   true    39      4       0       0       44      4       0       0       1       3to3
1       3305491                 103894906       2               false   true    28      4       0       0       31      4       0       0       1       3to3
1       3344038                 182182277       5               false   true    39      24      0       0       42      25      0       0       1       5to3

Thank you very much, Laura

NikdAK commented 3 years ago

So the versions as well as the data format is fine.

I assume one of the filter cutoffs is removing all lines.
These are the main 3 filter steps (coverage and MAPQ is only applied later):

  1. control.varFreq == 0
  2. tumor.varFreq >= 0.2
  3. distances >= 6000

Could you please manually run the Chromothripsis_Delly_Annotate-Filter.R script?

You can set this

setwd("PATH_BEFORE_NAME_FOLDERS")
name <- "NAME"
input <- paste0(name, "/results/Delly/",name,".breakpoints.tab")
FilterInterChromosomalRearrangements <- T
CoverageFilter <- 33
TumorVarFreq <- 0.2
maxDist <- 6000

... and then start at line 40 and stop at line 85, showing which filter(s) is removing everything:

#inputDT[control.varFreq == 0 & tumor.varFreq >= TumorVarFreq & distances >= maxDist] # original line
inputDT[control.varFreq == 0]
inputDT[tumor.varFreq >= TumorVarFreq]
inputDT[distances >= maxDist]

Another way would be to show me all the ranges, which tells us exactly which column is failing the filter: inputDT[, sapply(.SD, range)]

laugon10 commented 3 years ago

scriptFilterDelly.txt Hi, I have run the code I am attaching (scriptFilterDelly.R).

I have print a table before the final filtering step that look like this:

 head inputDTprefilter
"donChr" "donPos" "accChr" "accPos" "PE" "SR" "PRECISE" "IMPRECISE" "tumor.DR" "tumor.DV" "tumor.RR" "tumor.RV" "control.DR" "control.DV" "control.RR" "control.RV" "MAPQ" "Type" "control.pairs.ratio" "control.SR.ratio" "tumor.pairs.ratio" "tumor.SR.ratio" "control.coverage" "tumor.coverage" "control.varFreq" "tumor.varFreq" "distances"
"1" 3003408 NA NA 4 NA FALSE TRUE 22 4 0 0 28 4 0 0 1 "5to3" 0.125 0 0.153846153846154 0 32 26 0.125 0.153846153846154 NA
"1" 3003441 NA NA 2 NA FALSE TRUE 28 4 0 0 21 4 0 0 1 "5to3" 0.16 0 0.125 0 25 32 0.16 0.125 NA
"1" 3153301 NA NA 2 NA FALSE TRUE 31 6 0 0 31 7 0 0 1 "3to3" 0.184210526315789 0 0.162162162162162 0 38 37 0.184210526315789 0.162162162162162 NA
"1" 3232541 NA NA 6 NA FALSE TRUE 32 4 0 0 40 4 0 0 1 "5to3" 0.0909090909090909 0 0.111111111111111 0 44 36 0.0909090909090909 0.111111111111111 NA
"1" 3233304 NA NA 6 NA FALSE TRUE 39 4 0 0 44 4 0 0 1 "3to3" 0.0833333333333333 0 0.0930232558139535 0 48 43 0.0833333333333333 0.0930232558139535 NA
"1" 3305491 NA NA 2 NA FALSE TRUE 28 4 0 0 31 4 0 0 1 "3to3" 0.114285714285714 0 0.125 0 35 32 0.114285714285714 0.125 NA
"1" 3344038 NA NA 5 NA FALSE TRUE 39 24 0 0 42 25 0 0 1 "5to3" 0.373134328358209 0 0.380952380952381 0 67 63 0.373134328358209 0.380952380952381 NA
"1" 3452589 NA NA 2 NA FALSE TRUE 42 4 0 0 45 4 0 0 4 "3to5" 0.0816326530612245 0 0.0869565217391304 0 49 46 0.0816326530612245 0.0869565217391304 NA
"1" 3612939 NA NA 2 NA FALSE TRUE 41 26 0 0 48 27 0 0 1 "3to3" 0.36 0 0.388059701492537 0 75 67 0.36 0.388059701492537 NA

But with the final filtering step I have obtained an empty table.

# FILTERING
if(FilterInterChromosomalRearrangements){
  print(paste("Filtering only intrachromosomal rearrangements"))
  inputDT <- inputDT[donChr==accChr]
} 

This is how my CNVs plot look like, so I am expecting something in this table.

image

Thank you very much! Laura

NikdAK commented 3 years ago

Ok so the issue is the "NA" in the accChr and accPos column. This is probably due to the lines 55-57:

inputDT[accChr == "", accPos := END]
inputDT[accChr == "", accChr := donChr]
inputDT[, END := NULL]

This will set the values for the acceptor information based on donChr and END column.

I tried it for the 7 lines you sent me and it is working without issues. Could you please check what is happening with the mentioned find and replace commands? I.e. if the values in accChr are empty ("") like we expect them to be, maybe there are some hidden whitespaces. inputDT[accChr == ""]

laugon10 commented 3 years ago

Hi, I have tried and what I think is happening is that setting the values for acceptor is no working at all. I execute for ie: inputDT[accChr == "", accPos := END] And none of the NA are change in to the right value. They keep the NA value. It seems that NAs arre not recognized since with inputDT[accChr == ""]

Te result is: Empty data.table (0 rows and 27 cols): donChr,donPos,accChr,accPos,PE,SR... But with ie: inputDT[accChr == "1"]

I get:

 donChr   donPos accChr    accPos PE SR PRECISE IMPRECISE tumor.DR
   1:     10  3251873      1  54408453  4 NA   FALSE      TRUE       41
   2:     10  3990666      1 128638550  2 NA   FALSE      TRUE       58
   3:     10  3990826      1 128638448  5 NA   FALSE      TRUE       53
   4:     10  4499209      1  88254067  0  9    TRUE     FALSE        0
   5:     10  6088966      1  84965515  0  5    TRUE     FALSE        0
  ---
1296:      Y 70166070      1 145917349  2 NA   FALSE      TRUE       19
1297:      Y 81822591      1 125892066  5 NA   FALSE      TRUE       21
1298:      Y 81822736      1 125891794  3 NA   FALSE      TRUE       13
1299:      Y 82439135      1   3145039  2 NA   FALSE      TRUE       19
1300:      Y 88589743      1 179980522  2 NA   FALSE      TRUE       21
      tumor.DV tumor.RR tumor.RV control.DR control.DV control.RR control.RV
   1:        2        0        0         48          2          0          0
   2:        2        0        0         38          2          0          0
   3:        2        0        0         36          2          0          0
   4:        2       53       37          0          2         60         50
   5:        2       20        8          0          2         25          9
  ---
1296:        0        0        0         12         -1          0          0
1297:        0        0        0         19         -1          0          0
1298:        0        0        0         10         -1          0          0
1299:        0        0        0         25         -1          0          0
1300:        0        0        0         23         -1          0          0
      MAPQ Type control.pairs.ratio control.SR.ratio tumor.pairs.ratio
   1:   41 5to5          0.04000000        0.0000000        0.04651163
   2:   59 5to5          0.05000000        0.0000000        0.03333333
   3:   60 3to3          0.05263158        0.0000000        0.03636364
   4:    0 3to3          1.00000000        0.4545455        1.00000000
   5:    0 3to5          1.00000000        0.2647059        1.00000000
  ---
1296:   42 3to5         -0.09090909        0.0000000        0.00000000
1297:   60 5to5         -0.05555556        0.0000000        0.00000000
1298:   60 3to3         -0.11111111        0.0000000        0.00000000
1299:   47 3to5         -0.04166667        0.0000000        0.00000000
1300:   27 5to5         -0.04545455        0.0000000        0.00000000
      tumor.SR.ratio control.coverage tumor.coverage control.varFreq
   1:      0.0000000               50             43      0.04000000
   2:      0.0000000               40             60      0.05000000
   3:      0.0000000               38             55      0.05263158
   4:      0.4111111              112             92      0.46428571
   5:      0.2857143               36             30      0.30555556
  ---
1296:      0.0000000               11             19     -0.09090909
1297:      0.0000000               18             21     -0.05555556
1298:      0.0000000                9             13     -0.11111111
1299:      0.0000000               24             19     -0.04166667
1300:      0.0000000               22             21     -0.04545455
      tumor.varFreq distances
   1:    0.04651163        NA
   2:    0.03333333        NA
   3:    0.03636364        NA
   4:    0.42391304        NA
   5:    0.33333333        NA
  ---
1296:    0.00000000        NA
1297:    0.00000000        NA
1298:    0.00000000        NA
1299:    0.00000000        NA
1300:    0.00000000        NA

Sorry for all the trouble. I can not undertand the problem.

Laura

NikdAK commented 3 years ago

Yes, so these are inter-chromosomal breakpoints, which will always be there like this. Those are filtered out by default (intraChromOnly).

donChr   donPos accChr    accPos
10  3251873      1  54408453

As I said, we expect the intra-chromosomal to be an empty value (""), which should be the output of SnpSift.jar extractFields. We need to find out what the actual value of CHR2 and POS2 is in your Delly/name.breakpoints.tab.

CHROM   POS     CHR2    POS2    END
1       3003408                 81643743

How about this? inputDT[is.na(accChr)]

Alternatively you can attach me the subset of the original Delly/name.breakpoints.tab file (please extracted without parsing via console and with some of the lines with empty CHR2 values as shown above).

laugon10 commented 3 years ago

Thanks, I understand that only intrachromosomal are kept. (I wonder if it possible to obtain also filtered interchromosomal?).

This inputDT is the table I get before the final step to keep only intrachromosomal arrangements. Since the inputDT after this step its empty. Just to be sure we are talking about the same data file.

I am attaching a subset obtained like this. head -1000 NAME.breakpoints.tab > subset.breakpoints.tab subset.breakpoints.txt

The output of the comand:

> inputDT[is.na(accChr)]
       donChr   donPos accChr accPos PE SR PRECISE IMPRECISE tumor.DR tumor.DV
    1:      1  3003408     NA     NA  4 NA   FALSE      TRUE       22        4
    2:      1  3003441     NA     NA  2 NA   FALSE      TRUE       28        4
    3:      1  3153301     NA     NA  2 NA   FALSE      TRUE       31        6
    4:      1  3232541     NA     NA  6 NA   FALSE      TRUE       32        4
    5:      1  3233304     NA     NA  6 NA   FALSE      TRUE       39        4
   ---
34025:      Y 91313818     NA     NA  2 NA   FALSE      TRUE        0        0
34026:      Y 91314427     NA     NA  2 NA   FALSE      TRUE        0        0
34027:      Y 91317516     NA     NA  2 NA   FALSE      TRUE        0        0
34028:      Y 91318976     NA     NA  2 NA   FALSE      TRUE        0        0
34029:      Y 91319175     NA     NA  3 NA   FALSE      TRUE        0        0
       tumor.RR tumor.RV control.DR control.DV control.RR control.RV MAPQ Type
    1:        0        0         28          4          0          0    1 5to3
    2:        0        0         21          4          0          0    1 5to3
    3:        0        0         31          7          0          0    1 3to3
    4:        0        0         40          4          0          0    1 5to3
    5:        0        0         44          4          0          0    1 3to3
   ---
34025:        0        0          0         -1          0          0   13 3to5
34026:        0        0          0         -1          0          0    1 5to3
34027:        0        0          0         -1          0          0   49 5to3
34028:        0        0          0         -1          0          0    1 5to3
34029:        0        0          0         -1          0          0    1 3to5
       control.pairs.ratio control.SR.ratio tumor.pairs.ratio tumor.SR.ratio
    1:          0.12500000                0        0.15384615              0
    2:          0.16000000                0        0.12500000              0
    3:          0.18421053                0        0.16216216              0
    4:          0.09090909                0        0.11111111              0
    5:          0.08333333                0        0.09302326              0
   ---
34025:          1.00000000                0        0.00000000              0
34026:          1.00000000                0        0.00000000              0
34027:          1.00000000                0        0.00000000              0
34028:          1.00000000                0        0.00000000              0
34029:          1.00000000                0        0.00000000              0
       control.coverage tumor.coverage control.varFreq tumor.varFreq distances
    1:               32             26      0.12500000    0.15384615        NA
    2:               25             32      0.16000000    0.12500000        NA
    3:               38             37      0.18421053    0.16216216        NA
    4:               44             36      0.09090909    0.11111111        NA
    5:               48             43      0.08333333    0.09302326        NA
   ---
34025:               -1              0      1.00000000    0.00000000        NA
34026:               -1              0      1.00000000    0.00000000        NA
34027:               -1              0      1.00000000    0.00000000        NA
34028:               -1              0      1.00000000    0.00000000        NA
34029:               -1              0      1.00000000    0.00000000        NA

Thank you very much!

NikdAK commented 3 years ago

As expected, the problem was the cell value being NA instead of "". I pushed a fix for this just now.

These are the small changes to catch both cases:


# set types to avoid warnings
inputDT[, accPos := as.integer(accPos)]
inputDT[, accChr := as.integer(accChr)]

# empty acceptor CHROM+POS is equal to changes on the same chromosome
inputDT[accChr == "" | is.na(accChr), accPos := END]
inputDT[accChr == "" | is.na(accChr), accChr := donChr]
inputDT[, END := NULL]

Now this should work as expected, but maybe you have to adjust the cutoffs for your data (control.varFreq, tumor.varFreq, distances).

laugon10 commented 3 years ago

Thank you! This worked!! I started with 43371 lines in the Name.breakpoints.tab and now I get 33704 in the filtered. Do you think is enough strict in your experience?

NikdAK commented 3 years ago

No problem :)

Sorry, I can not tell you anything about the filtering just by the number of lines. I would recommend evaluating everything by the criteria described in the publication (i.e. Figure 11).