luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
301 stars 37 forks source link

Making random forest for 0.7.0 #136

Closed DiDeoxy closed 3 years ago

DiDeoxy commented 3 years ago

Hi, Dan,

I know there is no random forest for 0.7.0 so I am attempting to roll my own.

I have downloaded GIAB vcf file (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/) and alignment for NA12878 (https://github.com/genome-in-a-bottle/giab_data_indexes) as well as hg19 (which the above are aligned/called against) (https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/).

I have made the following config for train_random_forest,py:

{
    "truths": {
        "SAMPLE1.truth": {
            "vcf": "/scratch/maxh/barley-pgda/giab/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz",
            "bed": "/scratch/maxh/barley-pgda/giab/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel.bed"
        },
    },
    "examples": [
        {
            "reference": "/scratch/maxh/barley-pgda/giab/hg19.fa.gz",
            "reads": "/scratch/maxh/barley-pgda/giab/RMNISTHS_30xdownsample.bam",
            "truth": "SAMPLE1.truth"
        },
    ],
    "training": {
        "cross_validation_fraction": 0.2,
        "hyperparameters": [
            {
                "trees": 500,
                "min_node_size": 10
            },
            {
                "trees": 500,
                "min_node_size": 20
            },
            {
                "trees": 500,
                "max_depth": 10
            },
            {
                "trees": 500,
                "max_depth": 20
            },
            {
                "trees": 200,
                "max_depth": 10
            }
        ]
    }
}

Is this a reasonable setup for training the model or should I get more samples? I copied your example settings for training exactly, I have no idea if these are appropriate. Finally, I will be running this training on a compute cluster and I need to estimate resource utilization, I am currently targeting 6 cores and 24 GB of RAM and 24 hrs of wall time. Is this a reasonable estimate?

Oh, and I will be calling barley alignments with this forest, I can't find a good known variants data set for barley, this should degrade results to much should it?

Cheers,

Max H.

dancooke commented 3 years ago

Hi Max,

Version 0.7.0 is now released and the bundled forests are available. I would just use the germline forest germline.v0.7.0.forest, especially in the absence of a good truth set.

You can install the new version with forests as shown here.

Best Dan

DiDeoxy commented 3 years ago

Awesome, congrats on getting to the 0.7.0 release!

I got a clean install to compile, will test it out on some BAMs now!

Cheers,

Max.