googlegenomics / gcp-variant-transforms

GCP Variant Transforms
Apache License 2.0
134 stars 55 forks source link

AnnotateShards not producing **vep_output.vcf #717

Closed KevinDuringWork closed 2 years ago

KevinDuringWork commented 2 years ago

Hello GoogleGenomics,

I'm running the latest release [v0.11.0] (on ~50 samples and in Dataflow I'm seeing AnnotateShards succeed in under 30s. However I'm not seeing a _vep_output.vcf suffixed file produced in the annotation directory. This is confirmed in a latter step get_compression_type where input_pattern is searching for **_vep_output.vcf and and extensions is coming up empty.

Has VEP silently failed on each vcf shard {hash}/count_20000 ? Some guidance would be greatly appreciated.

_merge_headers(known_args, pipeline_args, File "/opt/gcp_variant_transforms/src/gcp_variant_transforms/vcf_to_bq.py", line 376, in _merge_headers merged_header = _add_inferred_headers(infer_headers_input_pattern, p, File "/opt/gcp_variant_transforms/src/gcp_variant_transforms/vcf_to_bq.py", line 168, in _add_inferred_headers _read_variants(all_patterns, File "/opt/gcp_variant_transforms/src/gcp_variant_transforms/vcf_to_bq.py", line 122, in _read_variants return pipeline_common.read_variants( File "/opt/gcp_variant_transforms/src/gcp_variant_transforms/pipeline_common.py", line 239, in read_variants compression_type = get_compression_type(all_patterns) File "/opt/gcp_variant_transforms/src/gcp_variant_transforms/pipeline_common.py", line 124, in get_compression_type raise ValueError(f"All input files must be in the same format. {input_patterns}, {extensions}")

Command fragment is as follows:

COMMAND="vcf_to_bq \ --pipeline_mode LARGE \ --allow_malformed_record \ --allow_incompatible_records \ --infer_annotation_types \ --variant_merge_strategy MOVE_TO_CALLS \ --worker_machine_type n1-standard-16 \ --worker_disk_type compute.googleapis.com/projects//zones//diskTypes/pd-ssd \ --disk_size_gb 500 \ --num_workers 25 \ --max_num_workers 100 \ --run_annotation_pipeline \ --annotation_output_dir ${ANNOTATION_OUTPUT} \ --input_pattern ${INPUT_PATTERN} \ --output_table ${OUTPUT_TABLE} \ --job_name vcf-to-bigquery \ --runner DataflowRunner"

KevinDuringWork commented 2 years ago

Enabled "Cloud Life Sciences API" and restarted ingestion.

[EDIT] enabling "Cloud Life Sciences API" did not help; same error occurred"

KevinDuringWork commented 2 years ago

Removed flag --infer_annotation_types but a new error occurred.

... File "/opt/gcp_variant_transforms/src/gcp_variant_transforms/libs/processed_variant.py", line 247, in __init__ self._annotation_processor = _AnnotationProcessor( File "/opt/gcp_variant_transforms/src/gcp_variant_transforms/libs/processed_variant.py", line 430, in __init__ raise ValueError('{} INFO not found in the header'.format(field)) ValueError: CSQ_VT INFO not found in the header ... Doesn't this suggest that 'AnnotateShards' has failed?

KevinDuringWork commented 2 years ago

Added role to compute worker - "Life Science Workflow Runner" worked. Closing issue.