snakemake-workflows / dna-seq-varlociraptor

A Snakemake workflow for calling small and structural variants under any kind of scenario (tumor/normal, tumor/normal/relapse, germline, pedigree, populations) via the unified statistical model of Varlociraptor.
MIT License
83 stars 39 forks source link

Failure to extend header parameters from tables/output/annotation_fields for 'vembrane table --header "{params.config[header]}" "{params.config[expr]}' #333

Open delpropo opened 3 days ago

delpropo commented 3 days ago

In table.smk, the rule vembrane_table calls get_vembrane_config to get the parameters for the vembrane header table. This should add the annotation fields from config.yaml.

Example from config.yaml

tables:
  activate: true
  output:
    annotation_fields:
      - BIOTYPE
      - LoFtool
      - REVEL
    event_prob: true
    observations: true
    short_observations: true
  generate_excel: false

get_vembrane_config appends the fields with the following code. The fields being BIOTYPE, Loftool, and REVEL in this example.

       annotation_fields = get_annotation_fields_for_tables(wildcards)
        append_items(
            annotation_fields,
            rename_ann_fields,
            "ANN['{}']".format,
            lambda x: x.lower(),
        )

sort_order is created and contains "ANN['SYMBOL']", "ANN['IMPACT']", etc, but is a limited list. It is extended with additional values but the list in not a complete list of all values available in vembrane ANN types.

Complete list of custome ANN types. https://github.com/vembrane/vembrane/blob/main/docs/ann_types.md

Finally, the sorted_columns_dict is created to return the values used in the vembrane_table rule. This removes any values which were added to columns_dict if they are not in sort_order. It would remove BIOTYPE and Loftool, but not REVEL as it is in sort_order.

    # sort columns, keeping only those in sort_order
    sorted_columns_dict = {k: columns_dict[k] for k in sort_order if k in columns_dict}
    join_items = ", ".join
    return {
        "expr": join_items(sorted_columns_dict.keys()),
        "header": join_items(sorted_columns_dict.values()),
    }

For a variants.fdr-controlled.tsv file, the final output of the tsv files consists only of the following columns and can't be extended.

symbol impact hgvsp hgvsc consequence clinical significance gnomad genome af exon revel chromosome position reference allele alternative allele protein position protein alteration (short) canonical mane_plus_clinical prob: absent prob: artifact prob: variant patient: allele frequency patient: read depth patient: short ref observations patient: short alt observations patient: observations id gene hgvsg feature end position

dlaehnemann commented 1 day ago

Thanks for digging into this and for taking the time to report this thoroughly!

We recently changed how this works, and it seems we haven't gotten to a clean solution,yet. REVEL and LoFtool should automagically work (as in: be included in the datavzrd report table) when they are specified via the annotations: vep: final_calls: plugins: list in the config.yaml. For other (more standard) ANN entries like BIOTYPE, I'm not sure how one should request those in the config.yaml, as we are:

  1. Looking to probably deprecate the whole tables: entry.
  2. Want to have a stable sort order of entries.
  3. Don't want to over-cram the default report table with all possible annotations, but rather focus on something like "the most informative in most cases":tm: .

So do you have a clear idea of what you would want in this case, and what you would want in a more general case? Maybe this can help us figure out how exactly we want to eventually resolve this.

delpropo commented 1 day ago

Vembrane table creates multiple lines for each variant based on annotation information. Because the files can be so large, I read the tsv and turn it into a zarr file then process it in sections while also doing some initial filtering and grouping. I don't generally use the config.yaml file for filtering because I have to change the criteria so often. I have had to use it to remove some chromosomal locations which have caused downstream processing issues as described in this article about problematic regions.

There can be multiple annotations for a single variant which means some variants have dozens of lines after vembrane is used to create the table. To get around this, I groupby each each individual variant for the file (subject or family) and include all columns which are unique for each variant ("chromosome", "position", "allele", max_af, etc). This ensures that each unique variant has its own line. Any additional annotation that has multiple values is aggregated into a list.

I currently process data using the flow chart method below. It can require allocating a lot of memory due to the size of the tsv files, but that is about the only issue.

I had thought about some alternative methods but those would have used the vcf file. The vcf files are already formatted for vembrane and I just needed to get the data out. The vembrane table rule already did this, and I can easily reformat using config.yaml.

graph TD
    A[BCF File] -->|vembrane| B[TSV File]
    B -->|xarray| C[Zarr Folder]          
    C -->|reread| H[Process by chromosome]
    H -->|filter, groupby | D[Processed Chr 1]
    H -->|filter, groupby| E[Processed Chr ..]
    D -->|aggregate| F[Save to TSV]
    E -->|aggregate| F[Save to TSV]   
    F -->|join| G[Final TSV File]
    I[Additional TSV files] -->|join| G[Final TSV File]

I think this answers your questions? I attached a copy of one of my config.yaml files for reference since I use a lot more columns than just BIOTYPE. I have also included a test file as an example of the final data exported.

config_test_file.zip

dlaehnemann commented 1 day ago

A couple of thoughts of how you might get most of this done within the existing workflow:

  1. I think you could do all of your filtering before the vembrane table creation, by providing custom filters in the config.yaml here: https://github.com/snakemake-workflows/dna-seq-varlociraptor/blob/207624516e1b0433b6f3dd489c6b5aee909121d1/config/config.yaml#L125 These can use arbitrary Python expressions, basically anything you can do with a vembrane filter command. As vembrane does this by streaming through the file, this should not require a lot of RAM. This will probably not include getting rid of multiple ANN entries per variant (basically per transcript), but that is something that later scripts in the workflow should take care of. Which brings me to:
  2. Have you previously used the report generated by snakemake --report report.zip? It should contain interactive views of you variant sets that you specify via the calling: fdr-control: events: definitions in config.yaml. You can execute this command, once the workflow has successfully completed, and if you unpack the resulting ZIP file, you just extract the contained directory and open the main index.html file. The contained interactive table views (generated with datavzrd of the variants are what we tailor the vembrane table command to in the workflow.

So if you haven't, maybe you can look into the in-workflow vembrane filtering and the report, and see if this meets some of your needs and if you need more things amended in that setup?

Also, if you have suggestions on making any of what I explained here clearer when deploying and configuring the workflow, please let me know. (Useful) documentation is hard...