Change Clinvar Hail Table generation to per-line JSON

MattWellie commented 1 year ago

I've experienced quite a few headaches with the generate-new-clinvar data process:

To use the privately-summarised version of clinvar as an annotation resource it has to be in a Hail Table format
Generating the underlying JSON for this data (~200MB) is fairly simple
Generating a Hail Table from that process via Pandas is a nightmare

panelapp_data = [
    {
         'locus': hl.Locus(
              contig=allele_map[allele_id]['chrom'],
              position=allele_map[allele_id]['pos'],
         ),
        'alleles': [allele_map[allele_id]['ref'], allele_map[allele_id]['alt']],  # [str, str]
        'contig': allele_map[allele_id]['chrom'],  # str
        'position': allele_map[allele_id]['pos'],  # int
        'id': allele_id,  # int
        'rating': rating.value,  # str
        'stars': stars,  # int
        'allele_id': allele_map[allele_id]['allele'],  # int
    },
]

So far I am:

Iterating over the data one contig at a time, and collecting all alleleIDs on that contig
Sorting that data by Chr and Pos, into a list of per-allele dictionaries
Taking this already ordered data, making into a Pandas Dataframe
Converting that Pandas DF to a Hail Table
Appending each contig Hail Table in turn to an empty HT
Eventually writing the whole thing to disc

This process takes about 40 mins in Hail Batch, even with insanely high resource provision. It was the only way I could get this to work without a big design change, and I encountered many errors:

Getting all the data for the table (~200MB) and writing out to JSON is trivially easy
Taking that JSON data straight from JSON -> DF -> Hail Table caused insane memory blowouts (100GB RAM wasn't enough)
Parsing each Contig in turn, getting the results, writing as separate HT, works (huge memory footprint)

I don't really understand why this is so difficult, it's not huge data... I got something working that I'm pretty unhappy with, given the monster resources. If we honed this down to the simple process it could be, each AIP run could copy the clinvar data from the NCBI FTP site, re-summarise the latest data, and run on that basis. At the moment, that would double the runtime and cost of AIP.

I would like to try Vlad's approach - creating one file of JSON-entry-per-line, and parsing directly into a Hail table with a given schema. This works for the core pipeline and VEP data, which is larger in every conceivable way. This might be a one-weird-trick to stop Hail exploding. There may also be some repartitioning that could fix this, e.g. operating on a single partition so that the sort operations don't balloon wildly out of control.

cassimons commented 1 year ago

Very cool. So is the lesson here that pandas -> ht is broken at that moment?

MattWellie commented 1 year ago

@cassimons I actually don't know - I haven't got the expertise to dig into why exactly this is happening, but even in my wildest nightmares I didn't expect a VM provision of 80GB to be required when translating 200MB of data between formats. It reminds me of the compound-het calculations way back when, which also caused absurd memory blowouts. I'll post this on Zulip to see if they have a good explanation, but without @vladsavelyev's template code to follow, I wouldn't have had a good solution for this.

My guess would be that as I was creating the Pandas DataFrame indexed on a Hail datatype field (Locus), then making that into a Hail Table, some expensive sorting takes place. I'm lost on why that same sorting wouldn't take place when the key is set within an existing table, as the data was in the same order in both scenarios.

MattWellie commented 1 year ago

Zulip thread: https://hail.zulipchat.com/#narrow/stream/123010-Hail-Query-0.2E2-support/topic/Expensive.20translation.20from.20Pandas.20DF.20to.20Hail.20Table

populationgenomics / automated-interpretation-pipeline

Change Clinvar Hail Table generation to per-line JSON #174