Open delpropo opened 1 month ago
Thanks for digging into this and for taking the time to report this thoroughly!
We recently changed how this works, and it seems we haven't gotten to a clean solution,yet. REVEL
and LoFtool
should automagically work (as in: be included in the datavzrd report table) when they are specified via the annotations: vep: final_calls: plugins:
list in the config.yaml
. For other (more standard) ANN
entries like BIOTYPE
, I'm not sure how one should request those in the config.yaml
, as we are:
tables:
entry.So do you have a clear idea of what you would want in this case, and what you would want in a more general case? Maybe this can help us figure out how exactly we want to eventually resolve this.
Vembrane table creates multiple lines for each variant based on annotation information. Because the files can be so large, I read the tsv and turn it into a zarr file then process it in sections while also doing some initial filtering and grouping. I don't generally use the config.yaml file for filtering because I have to change the criteria so often. I have had to use it to remove some chromosomal locations which have caused downstream processing issues as described in this article about problematic regions.
There can be multiple annotations for a single variant which means some variants have dozens of lines after vembrane is used to create the table. To get around this, I groupby each each individual variant for the file (subject or family) and include all columns which are unique for each variant ("chromosome", "position", "allele", max_af, etc). This ensures that each unique variant has its own line. Any additional annotation that has multiple values is aggregated into a list.
I currently process data using the flow chart method below. It can require allocating a lot of memory due to the size of the tsv files, but that is about the only issue.
I had thought about some alternative methods but those would have used the vcf file. The vcf files are already formatted for vembrane and I just needed to get the data out. The vembrane table rule already did this, and I can easily reformat using config.yaml.
graph TD
A[BCF File] -->|vembrane| B[TSV File]
B -->|xarray| C[Zarr Folder]
C -->|reread| H[Process by chromosome]
H -->|filter, groupby | D[Processed Chr 1]
H -->|filter, groupby| E[Processed Chr ..]
D -->|aggregate| F[Save to TSV]
E -->|aggregate| F[Save to TSV]
F -->|join| G[Final TSV File]
I[Additional TSV files] -->|join| G[Final TSV File]
I think this answers your questions? I attached a copy of one of my config.yaml files for reference since I use a lot more columns than just BIOTYPE. I have also included a test file as an example of the final data exported.
A couple of thoughts of how you might get most of this done within the existing workflow:
vembrane table
creation, by providing custom filters in the config.yaml
here:
https://github.com/snakemake-workflows/dna-seq-varlociraptor/blob/207624516e1b0433b6f3dd489c6b5aee909121d1/config/config.yaml#L125
These can use arbitrary Python expressions, basically anything you can do with a vembrane filter
command. As vembrane does this by streaming through the file, this should not require a lot of RAM. This will probably not include getting rid of multiple ANN entries per variant (basically per transcript), but that is something that later scripts in the workflow should take care of. Which brings me to:snakemake --report report.zip
? It should contain interactive views of you variant sets that you specify via the calling: fdr-control: events:
definitions in config.yaml
. You can execute this command, once the workflow has successfully completed, and if you unpack the resulting ZIP file, you just extract the contained directory and open the main index.html
file. The contained interactive table views (generated with datavzrd
of the variants are what we tailor the vembrane table
command to in the workflow.So if you haven't, maybe you can look into the in-workflow vembrane filtering and the report, and see if this meets some of your needs and if you need more things amended in that setup?
Also, if you have suggestions on making any of what I explained here clearer when deploying and configuring the workflow, please let me know. (Useful) documentation is hard...
In table.smk, the rule vembrane_table calls get_vembrane_config to get the parameters for the vembrane header table. This should add the annotation fields from config.yaml.
Example from config.yaml
get_vembrane_config appends the fields with the following code. The fields being BIOTYPE, Loftool, and REVEL in this example.
sort_order is created and contains "ANN['SYMBOL']", "ANN['IMPACT']", etc, but is a limited list. It is extended with additional values but the list in not a complete list of all values available in vembrane ANN types.
Complete list of custome ANN types. https://github.com/vembrane/vembrane/blob/main/docs/ann_types.md
Finally, the sorted_columns_dict is created to return the values used in the vembrane_table rule. This removes any values which were added to columns_dict if they are not in sort_order. It would remove BIOTYPE and Loftool, but not REVEL as it is in sort_order.
For a variants.fdr-controlled.tsv file, the final output of the tsv files consists only of the following columns and can't be extended.
symbol impact hgvsp hgvsc consequence clinical significance gnomad genome af exon revel chromosome position reference allele alternative allele protein position protein alteration (short) canonical mane_plus_clinical prob: absent prob: artifact prob: variant patient: allele frequency patient: read depth patient: short ref observations patient: short alt observations patient: observations id gene hgvsg feature end position