luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
301 stars 37 forks source link

parameter setting for de novo in trio data #158

Closed Yiguan closed 3 years ago

Yiguan commented 3 years ago

Hi!

It's an excellent tool! Also thanks for providing such detailed manual. Impressive!

I have some Illunima data from fruit flies and trying to detect the de novo mutations on germline. This is the command I am testing:

octopus -R my.fa \
        -I sample_43.sort.bam   sample_32.sort.bam    sample_62.sort.bam \
        -F sample_43 \
        -M sample_32 \
        --regions 4 \
        --denovos-only \
        --thread 8 \
        -o trio_denovosOnly.vcf

There are dozens of variants reported in the trio_denovosOnly.vcf file. In fact, the chromosome 4 in fruit fly is only about 1.3Mb, and there are unlikely to have new mutations in one generation on this chromosome. These reported variants are likely to be false positives.

My question is which arguments/parameters in octopus could be used to minimise the detection of false positives even if at the cost of running time and memory.

I currently only use the default parameters and I appreciate it if you could offer some suggestion!

Thanks!

dancooke commented 3 years ago

Hi, which version of Octopus are you using?

Yiguan commented 3 years ago

Hi, which version of Octopus are you using?

I think it the latest version:

octopus version 0.6.3-beta
Target: x86_64 Linux 5.4.0-1041-azure
Compiler: GNU 9.3.0
Boost: 1_74

Thanks!

dancooke commented 3 years ago

v0.6.3b is actually very old. I'm assuming you're installing via Conda? I'm trying to get the Bioconda version updated to the latest version which has a number of improvements for de novo calling. Hopefully this will be available in the next few days. Otherwise you can try installing manually.

Yiguan commented 3 years ago

v0.6.3b is actually very old. I'm assuming you're installing via Conda? I'm trying to get the Bioconda version updated to the latest version which has a number of improvements for de novo calling. Hopefully this will be available in the next few days. Otherwise you can try installing manually.

Yes, I installed it via conda. I will try to install the latest version manually and test it again. Thanks!

Yiguan commented 3 years ago

v0.6.3b is actually very old. I'm assuming you're installing via Conda? I'm trying to get the Bioconda version updated to the latest version which has a number of improvements for de novo calling. Hopefully this will be available in the next few days. Otherwise you can try installing manually.

Hi, Following your instructions , I tried to install it via github:

$ git clone -b master https://github.com/luntergroup/octopus.git
$ octopus/scripts/install.py --dependencies --forests

but encountered an error

==> ../configure --prefix=/data/home/wwang1111/octopus/build/brew/Cellar/gcc/10.2.0_4 --libdir=/data/home/wwang1111/octopus/build/brew/Cellar/gcc/10.2.0_4/lib/gcc/10 --di
==> make
Last 15 lines from /data/home/wwang1111/.cache/Homebrew/Logs/gcc/02.make:
checking whether /usr/bin/make sets $(MAKE)... yes
checking whether /usr/bin/make supports nested variables... yes
checking for x86_64-pc-linux-gnu-gcc... /tmp/gcc-20210407-4531-8dqqpx/gcc-10.2.0/build/./gcc/xgcc -B/tmp/gcc-20210407-4531-8dqqpx/gcc-10.2.0/build/./gcc/ -B/data/home/wwang1111/octopus/build/brew/Cellar/gcc/10.2.0_4/x86_64-pc-linux-gnu/bin/ -B/data/home/wwang1111/octopus/build/brew/Cellar/gcc/10.2.0_4/x86_64-pc-linux-gnu/lib/ -isystem /data/home/wwang1111/octopus/build/brew/Cellar/gcc/10.2.0_4/x86_64-pc-linux-gnu/include -isystem /data/home/wwang1111/octopus/build/brew/Cellar/gcc/10.2.0_4/x86_64-pc-linux-gnu/sys-include   -fno-checking
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... configure: error: in `/tmp/gcc-20210407-4531-8dqqpx/gcc-10.2.0/build/x86_64-pc-linux-gnu/libgomp':
configure: error: cannot run C compiled programs.
If you meant to cross compile, use `--host'.
See `config.log' for more details
make[2]: *** [configure-stage1-target-libgomp] Error 1
make[2]: Leaving directory `/tmp/gcc-20210407-4531-8dqqpx/gcc-10.2.0/build'
make[1]: *** [stage1-bubble] Error 2
make[1]: Leaving directory `/tmp/gcc-20210407-4531-8dqqpx/gcc-10.2.0/build'
make: *** [all] Error 2

READ THIS: https://docs.brew.sh/Troubleshooting

These open issues may also help:
fibjs: build with gcc https://github.com/Homebrew/linuxbrew-core/pull/22696
emscripten: use newer gcc https://github.com/Homebrew/linuxbrew-core/pull/22455
lc0: use newer GCC https://github.com/Homebrew/linuxbrew-core/pull/22550
hbase: build with system gcc https://github.com/Homebrew/linuxbrew-core/pull/22685
gdb: use system gcc, remove workaround https://github.com/Homebrew/linuxbrew-core/pull/22698
Bootstrapped gcc@5 has broken symlinks https://github.com/Homebrew/linuxbrew-core/issues/22511
gcc@10 fails because of outdated zstd on Ubuntu 16.04 https://github.com/Homebrew/linuxbrew-core/issues/22800
Traceback (most recent call last):
  File "octopus/scripts/install.py", line 515, in <module>
    main(args)
  File "octopus/scripts/install.py", line 422, in main
    ret = call(["cmake3"] + cmake_options + [".."])
  File "/usr/lib64/python3.6/subprocess.py", line 287, in call
    with Popen(*popenargs, **kwargs) as p:
  File "/usr/lib64/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/usr/lib64/python3.6/subprocess.py", line 1364, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
PermissionError: [Errno 13] Permission denied: 'cmake3'

Is it trying to install gcc_10.2 but failed?

I actually have updated my gcc to the latest version manually.

$ gcc --version
gcc (GCC) 10.2.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Yiguan commented 3 years ago

A follow up about the mutation detected using different version of octopus.

Here is the comparison on the same test data using two different octopus versions.

For version 0.6.3-beta installed via conda, the inferred de novo mutations are:

CHR  POS    REF    ALT
4       12050   A       T
4       25522   T       G
4       25907   T       C
4       147101  AAAAGGTAAATTTATTATCTGTTCCGAATCC A
4       147133  TATC    T
4       147145  GCAATAGATT      G
4       147174  T       A
4       147217  C       T
4       147222  A       T
4       637858  G       A
4       792408  T       C
4       1267507 AT      A,*
4       1280572 G       A
4       1293153 G       A
4       1305360 T       A
4       1316559 G       A

For version 0.7.2, the inferred de novo mutations are:

CHR  POS    REF    ALT
4       12050   A       T
4       25522   T       G
4       25907   T       C
4       225526  T       C
4       225532  A       T
4       637858  G       A
4       786848  A       C
4       791262  G       T
4       792408  T       C
4       1200133 C       A
4       1200140 C       A
4       1200147 C       A
4       1200495 T       G
4       1218262 A       T
4       1218272 A       T
4       1280572 G       A
4       1293153 G       A
4       1295375 G       A
4       1305360 T       A

Although there are some consensus sites, the two sets results are quite different. The newer version octopus seems to report more mutations.

dancooke commented 3 years ago

I'd suggest

I'd also recommend against using the --denovos-only option. This prevents you from re-filtering the call set and you loose haplotype information from germline variants. You can easily extract de novo calls from the default output as they are flagged with DENOVO in the INFO column:

$ bcftools view -i 'DENOVO=1' trio.vcf > trio.denovo.vcf

Finally, I'd recommend using the --bamout option to make evidence BAMs. So in summary, try

$ octopus -R my.fa \
        -I sample_43.sort.bam   sample_32.sort.bam    sample_62.sort.bam \
        -F sample_43 \
        -M sample_32 \
        --snp-heterozygosity XXX \
        --indel-heterozygosity YYY \
        --max-haplotypes 300 \
        --regions 4 \
        --forest germline.v0.7.2.forest \
        --thread 8 \
        -o trio.vcf \
        --bamout evidence

If you're still finding lots of PASS denovo calls then take a look at the evidence BAMs to see what's going on. I'm also happy to take a look at these if you'd like to post them along with the VCF.

Yiguan commented 3 years ago

Hi, I re-ran the test with the parameters you suggested as following:

octopus -R chr4.fa \
        -I sample_43_chr4.bam sample_32_chr4.bam sample_62_chr4.bam \
        -F sample_43 \
        -M sample_32 \
        --denovo-snv-prior 2.8e-9 \
        --denovo-indel-prior 1.2e-9 \
        --snp-heterozygosity 0.005 \
        --indel-heterozygosity 0.0005 \
        --max-haplotypes 300 \
        --regions 4 \
        --forest germline.v0.7.2.forest.gz \
        --thread 8 \
        -o trio_V0.7_ref6.36_s62.vcf \
        --bamout evidence

But unfortunately, it encountered an error:

...
[2021-04-09 23:16:55] <INFO>       4:1301839            96.3%          54m 45s             2m 7s
[2021-04-09 23:19:56] <INFO>       4:1313994            97.5%          57m 46s            1m 32s
[2021-04-09 23:21:15] <INFO>       4:1333424            98.9%           59m 5s               37s
[2021-04-09 23:21:16] <INFO>               -             100%           59m 6s                 -
[2021-04-09 23:21:16] <INFO> Starting Call Set Refinement (CSR) filtering
[2021-04-09 23:21:16] <INFO> Removed 4 temporary files
[2021-04-09 23:21:16] <EROR> A program error has occurred:
[2021-04-09 23:21:16] <EROR> 
[2021-04-09 23:21:16] <EROR>     Encountered an exception during calling 'std::bad_alloc'. This means
[2021-04-09 23:21:16] <EROR>     there is a bug and your results are untrustworthy.
[2021-04-09 23:21:16] <EROR> 
[2021-04-09 23:21:16] <EROR> To help resolve this error run in debug mode and send the log file to
[2021-04-09 23:21:16] <EROR> https://github.com/luntergroup/octopus/issues.
[2021-04-09 23:21:16] <INFO> ------------------------------------------------------------------------

I tested it many times and always got the same error. It's unlikely to be caused by the memory issue since we have ENOUGH available memory on our workstation.

I am not sure if it's a bug in the octopus or somewhere goes wrong in my Linux environment. I shared the data on GoogleDrive: 3 bam files and a reference genome file.

https://drive.google.com/drive/folders/1dySSlXnKJTotUNndxOkT-Bw1PErpHP9V?usp=sharing

Thanks!

dancooke commented 3 years ago

Thanks for the data. Unfortunately I can't replicate the exception on my system - the run completes successfully. What OS are you running? I can try to replicate in a Docker container.

Regarding the de novos, these do appear to be false positives and unfortunately the random forest doesn't filter them. I'm going to include more trio training data in the next release forest so that should improve. For now, there are other reported annotations that strongly point to these being false positives: MP, PP, and GQ. At least one of these is very low in all of the called DENOVO mutations. For example, using the filter

$ bcftools view -f PASS -i 'DENOVO=1&&MP>10&&PP>10&&MIN(GQ)>10' trio.vcf > trio.denovo.highconf.vcf 

results in zero de novo calls. Although I usually don't like hard filtering, these are very relaxed thresholds for these annotations, so it's unlikely to affect sensitivity much.

Yiguan commented 3 years ago

Thanks so much! I think it's probably essential to do some hard filtering for my data. Reads simulation may be helpful to determine a threshold for the annotation. I will try that. Also, looking forward your upgrading the octopus on conda.

dancooke commented 3 years ago

The exception is due to the forest files being proved in compressed form - they need to be decompressed.