projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
262 stars 106 forks source link

Performance of split and normalize transformers #453

Closed Hoeze closed 4 months ago

Hoeze commented 2 years ago

Hi, is it possible that the split and normalize transformers are significantly slower than bcftools norm?

We just tried splitting a large 190GB VCF file with 800 samples from https://dataportal.answerals.org/home :

# Read dataset vcf

df = (
    spark
    .read
    .format('vcf')
    .option('flattenInfoFields', False)
    .load(snakemake.input['vcf'])
)

if snakemake.params['drop_INFO'] == True:
    df = df.drop(f.col('attributes'))

if snakemake.params['whole_stage_codegen'] == False:
    spark.conf.set("spark.sql.codegen.wholeStage", False)

df = glow.transform("split_multiallelics", df)
df = glow.transform("normalize_variants", df, reference_genome_path=snakemake.input['fasta'])

(
    df
    .write
    .parquet(snakemake.output['vcf_norm'], mode="overwrite", partitionBy=["contigName", ])
)

Running on 64 cores, 250GB RAM, this would have taken ~10 days, no matter if we disable whole stage codegen.

Then we tried the same with bcftools, split by chromosome:

bcftools view "{input.vcf}" "{params.chrom}" | \
    bcftools norm -m - -f "{input.fasta}" --threads {threads} -O z -o {output.vcf_norm}

This took with 46 cores and 184GB RAM around three hours.

Where does this difference stem from? Am I doing some mistake or how can I improve on that?

williambrandler commented 2 years ago

hey @Hoeze ten days seems excessive! I would not expect it to take this long

How many variants and genotypes are in your VCF? Please checkpoint to parquet/delta after the initial read of the VCF. Performance will be much better if you treat ingest separately from any ETL steps (normalize / split).

Either way, these QC transformers are the most expensive step in the glow pipeline, see benchmarks on 500k samples / 250k variants: https://glow.readthedocs.io/en/latest/benchmarks.html

@kianfar77 what do you think? Is there a simple way to optimize this process?

williambrandler commented 2 years ago

spoke to @kianfar77,

best performance will be achieved when each step is run independently

  1. ingest vcf to parquet/delta
  2. extract multiallelics and indels into separate parquet/delta tables from biallelic SNPs (using length of referenceAllele and alternateAlleles)
  3. split multiallelics -> checkpoint to parquet/delta
  4. normalize indels -> checkpoint to parquet/delta
  5. append results from each step into single delta table, Zorder on contigName, start and end

The bottleneck is most likely in split_multialleleics. It has a nested spark transform function with a udf which is unavoidable and it is quite slow in spark. This is an area we can look at improving. But the majority of variants are not multiallelic. If you follow steps 1-4 for now you will get good results (not on par with bcftools, but not 10 days either)

williambrandler commented 2 years ago

@Hoeze pulled together a notebook that breaks out each step https://github.com/projectglow/glow/pull/455

This utilizes the append feature of delta lake, but can be adapted for parquet

Hoeze commented 2 years ago

Thanks a lot for putting so much effort into this issue @williambrandler! The notebook is really helpful :+1:

Another idea I had was to call bcftools with glow.transform. Do you think this creates any issues over the built-in transformers?

For the record, here the stats of the vcf:

# bcftools index -s als/genomics/4_JointGenotyping/AnswerALS-866-G-v1-release5_joint-vcf-vqsr-annotated.vcf.gz
# name, length, nr. of records
chr1    248956422       3359080
chr2    242193529       3630901
chr3    198295559       3050269
chr4    190214555       2960827
chr5    181538259       2762725
chr6    170805979       2664879
chr7    159345973       2442933
chr8    145138636       2397787
chr9    138394717       1816266
chr10   133797422       2089000
chr11   135086622       2076529
chr12   133275309       2034996
chr13   114364328       1511920
chr14   107043718       1373426
chr15   101991189       1233256
chr16   90338345        1360226
chr17   83257441        1202475
chr18   80373285        1186463
chr19   58617616        954582
chr20   64444167        970779
chr21   46709983        557983
chr22   50818468        569685
chrX    156040895       1442073

That's a total of 43,649,060 records.

williambrandler commented 2 years ago

You can use bcftools with the Glow Pipe Transformer

I actually have a notebook example of how to use bcftools to filter vcfs with the pipe transformer, which you can adapt, but not open sourced it yet. Drop an email to glow.contributors@gmail.com and I will send it over

Cheers

bio-bench commented 2 years ago

You can use bcftools with the Glow Pipe Transformer

I actually have a notebook example of how to use bcftools to filter vcfs with the pipe transformer, which you can adapt, but not open sourced it yet. Drop an email to glow.contributors@gmail.com and I will send it over

Cheers

@williambrandler - have you tried using bcftools with glow pipe to merge VCF files?