populationgenomics / automated-interpretation-pipeline

Rare Disease variant prioritisation MVP
MIT License
5 stars 4 forks source link

Change Clinvar Hail Table generation to per-line JSON #174

Closed MattWellie closed 1 year ago

MattWellie commented 1 year ago

I've experienced quite a few headaches with the generate-new-clinvar data process:

panelapp_data = [
    {
         'locus': hl.Locus(
              contig=allele_map[allele_id]['chrom'],
              position=allele_map[allele_id]['pos'],
         ),
        'alleles': [allele_map[allele_id]['ref'], allele_map[allele_id]['alt']],  # [str, str]
        'contig': allele_map[allele_id]['chrom'],  # str
        'position': allele_map[allele_id]['pos'],  # int
        'id': allele_id,  # int
        'rating': rating.value,  # str
        'stars': stars,  # int
        'allele_id': allele_map[allele_id]['allele'],  # int
    },
]

So far I am:

This process takes about 40 mins in Hail Batch, even with insanely high resource provision. It was the only way I could get this to work without a big design change, and I encountered many errors:

I don't really understand why this is so difficult, it's not huge data... I got something working that I'm pretty unhappy with, given the monster resources. If we honed this down to the simple process it could be, each AIP run could copy the clinvar data from the NCBI FTP site, re-summarise the latest data, and run on that basis. At the moment, that would double the runtime and cost of AIP.

I would like to try Vlad's approach - creating one file of JSON-entry-per-line, and parsing directly into a Hail table with a given schema. This works for the core pipeline and VEP data, which is larger in every conceivable way. This might be a one-weird-trick to stop Hail exploding. There may also be some repartitioning that could fix this, e.g. operating on a single partition so that the sort operations don't balloon wildly out of control.

cassimons commented 1 year ago

Very cool. So is the lesson here that pandas -> ht is broken at that moment?

MattWellie commented 1 year ago

@cassimons I actually don't know - I haven't got the expertise to dig into why exactly this is happening, but even in my wildest nightmares I didn't expect a VM provision of 80GB to be required when translating 200MB of data between formats. It reminds me of the compound-het calculations way back when, which also caused absurd memory blowouts. I'll post this on Zulip to see if they have a good explanation, but without @vladsavelyev's template code to follow, I wouldn't have had a good solution for this.

My guess would be that as I was creating the Pandas DataFrame indexed on a Hail datatype field (Locus), then making that into a Hail Table, some expensive sorting takes place. I'm lost on why that same sorting wouldn't take place when the key is set within an existing table, as the data was in the same order in both scenarios.

MattWellie commented 1 year ago

Zulip thread: https://hail.zulipchat.com/#narrow/stream/123010-Hail-Query-0.2E2-support/topic/Expensive.20translation.20from.20Pandas.20DF.20to.20Hail.20Table