loculus-project / loculus

An open-source software package to power microbial genomic databases
https://loculus.org
GNU Affero General Public License v3.0
34 stars 2 forks source link

Bad (?) data can crash preprocessing #2339

Closed theosanderson closed 2 months ago

theosanderson commented 2 months ago

I uploaded the example data for ebola zaire (which we may be out of date) and found preprocessing stalled. When looking at the logs I found a crash, which shouldn't happen:

INFO:root:Using config: Config(organism='ebola-zaire', backend_host='http://loculus-backend-service:8079/ebola-zaire', keycloak_host='https://authentication-main.loculus.org', keycloak_user='preprocessing_pipeline', keycloak_password='fc97e3c00001d9d3a5', keycloak_token_path='realms/loculus/protocol/openid-connect/token', nextclade_dataset_name='nextstrain/ebola/zaire', nextclade_dataset_tag=None, nextclade_dataset_server='https://raw.githubusercontent.com/nextstrain/nextclade_data/ebola/data_output', config_file='/etc/config/preprocessing-config.yaml', log_level='DEBUG', genes=['NP', 'VP35', 'VP40', 'GP', 'sGP', 'ssGP', 'VP30', 'VP24', 'L'], nucleotideSequences=['main'], keep_tmp_dir=False, reference_length=197209, batch_size=100, processing_spec={'sample_collection_date': {'function': 'process_date', 'inputs': {'date': 'sample_collection_date'}, 'args': {'type': 'date'}, 'required': True}, 'display_name': {'function': 'concatenate', 'inputs': {'geo_loc_country': 'geo_loc_country', 'sample_collection_date': 'sample_collection_date'}, 'args': {'order': ['geo_loc_country', 'accession_version', 'sample_collection_date'], 'type': ['string', 'string', 'date']}}, 'ncbi_release_date': {'function': 'parse_timestamp', 'inputs': {'timestamp': 'ncbi_release_date'}, 'args': {'type': 'date'}}, 'ncbi_update_date': {'function': 'parse_timestamp', 'inputs': {'timestamp': 'ncbi_update_date'}, 'args': {'type': 'date'}}, 'geo_loc_country': {'function': 'identity', 'inputs': {'input': 'geo_loc_country'}, 'args': None, 'required': True}, 'geo_loc_admin_1': {'function': 'identity', 'inputs': {'input': 'geo_loc_admin_1'}, 'args': None}, 'geo_loc_admin_2': {'function': 'identity', 'inputs': {'input': 'geo_loc_admin_2'}, 'args': None}, 'geo_loc_city': {'function': 'identity', 'inputs': {'input': 'geo_loc_city'}, 'args': None}, 'geo_loc_site': {'function': 'identity', 'inputs': {'input': 'geo_loc_site'}, 'args': None}, 'specimen_collector_sample_id': {'function': 'identity', 'inputs': {'input': 'specimen_collector_sample_id'}, 'args': None}, 'authors': {'function': 'identity', 'inputs': {'input': 'authors'}, 'args': {'type': 'authors'}}, 'author_affiliations': {'function': 'identity', 'inputs': {'input': 'author_affiliations'}, 'args': None}, 'ncbi_submitter_country': {'function': 'identity', 'inputs': {'input': 'ncbi_submitter_country'}, 'args': None}, 'insdc_accession_base': {'function': 'identity', 'inputs': {'input': 'insdc_accession_base'}, 'args': None}, 'insdc_version': {'function': 'identity', 'inputs': {'input': 'insdc_version'}, 'args': {'type': 'int'}}, 'insdc_accession_full': {'function': 'identity', 'inputs': {'input': 'insdc_accession_full'}, 'args': None}, 'bioproject_accessions': {'function': 'identity', 'inputs': {'input': 'bioproject_accessions'}, 'args': None}, 'biosample_accession': {'function': 'identity', 'inputs': {'input': 'biosample_accession'}, 'args': None}, 'culture_id': {'function': 'identity', 'inputs': {'input': 'culture_id'}, 'args': None}, 'sample_received_date': {'function': 'process_date', 'inputs': {'date': 'sample_received_date'}, 'args': {'type': 'date'}}, 'sample_type': {'function': 'identity', 'inputs': {'input': 'sample_type'}, 'args': None}, 'purpose_of_sampling': {'function': 'identity', 'inputs': {'input': 'purpose_of_sampling'}, 'args': None}, 'presampling_activity': {'function': 'identity', 'inputs': {'input': 'presampling_activity'}, 'args': None}, 'anatomical_material': {'function': 'identity', 'inputs': {'input': 'anatomical_material'}, 'args': None}, 'anatomical_part': {'function': 'identity', 'inputs': {'input': 'anatomical_part'}, 'args': None}, 'body_product': {'function': 'identity', 'inputs': {'input': 'body_product'}, 'args': None}, 'environmental_material': {'function': 'identity', 'inputs': {'input': 'environmental_material'}, 'args': None}, 'environmental_site': {'function': 'identity', 'inputs': {'input': 'environmental_site'}, 'args': None}, 'collection_device': {'function': 'identity', 'inputs': {'input': 'collection_device'}, 'args': None}, 'collection_method': {'function': 'identity', 'inputs': {'input': 'collection_method'}, 'args': None}, 'food_product': {'function': 'identity', 'inputs': {'input': 'food_product'}, 'args': None}, 'food_product_properties': {'function': 'identity', 'inputs': {'input': 'food_product_properties'}, 'args': None}, 'specimen_processing': {'function': 'identity', 'inputs': {'input': 'specimen_processing'}, 'args': None}, 'specimen_processing_details': {'function': 'identity', 'inputs': {'input': 'specimen_processing_details'}, 'args': None}, 'experimental_specimen_role_type': {'function': 'identity', 'inputs': {'input': 'experimental_specimen_role_type'}, 'args': None}, 'host_age': {'function': 'identity', 'inputs': {'input': 'host_age'}, 'args': {'type': 'int'}}, 'host_age_bin': {'function': 'identity', 'inputs': {'input': 'host_age_bin'}, 'args': None}, 'host_gender': {'function': 'identity', 'inputs': {'input': 'host_gender'}, 'args': None}, 'host_origin_country': {'function': 'identity', 'inputs': {'input': 'host_origin_country'}, 'args': None}, 'host_disease': {'function': 'identity', 'inputs': {'input': 'host_disease'}, 'args': None}, 'signs_and_symptoms': {'function': 'identity', 'inputs': {'input': 'signs_and_symptoms'}, 'args': None}, 'host_health_state': {'function': 'identity', 'inputs': {'input': 'host_health_state'}, 'args': None}, 'host_health_outcome': {'function': 'identity', 'inputs': {'input': 'host_health_outcome'}, 'args': None}, 'travel_history': {'function': 'identity', 'inputs': {'input': 'travel_history'}, 'args': None}, 'exposure_event': {'function': 'identity', 'inputs': {'input': 'exposure_event'}, 'args': None}, 'host_role': {'function': 'identity', 'inputs': {'input': 'host_role'}, 'args': None}, 'exposure_setting': {'function': 'identity', 'inputs': {'input': 'exposure_setting'}, 'args': None}, 'exposure_details': {'function': 'identity', 'inputs': {'input': 'exposure_details'}, 'args': None}, 'previous_infection_disease': {'function': 'identity', 'inputs': {'input': 'previous_infection_disease'}, 'args': None}, 'previous_infection_organism': {'function': 'identity', 'inputs': {'input': 'previous_infection_organism'}, 'args': None}, 'host_vaccination_status': {'function': 'identity', 'inputs': {'input': 'host_vaccination_status'}, 'args': None}, 'purpose_of_sequencing': {'function': 'identity', 'inputs': {'input': 'purpose_of_sequencing'}, 'args': None}, 'sequencing_date': {'function': 'process_date', 'inputs': {'date': 'sequencing_date'}, 'args': {'type': 'date'}}, 'amplicon_pcr_primer_scheme': {'function': 'identity', 'inputs': {'input': 'amplicon_pcr_primer_scheme'}, 'args': None}, 'amplicon_size': {'function': 'identity', 'inputs': {'input': 'amplicon_size'}, 'args': None}, 'sequencing_instrument': {'function': 'identity', 'inputs': {'input': 'sequencing_instrument'}, 'args': None}, 'sequencing_protocol': {'function': 'identity', 'inputs': {'input': 'sequencing_protocol'}, 'args': None}, 'sequencing_assay_type': {'function': 'identity', 'inputs': {'input': 'sequencing_assay_type'}, 'args': None}, 'sequenced_by_organization': {'function': 'identity', 'inputs': {'input': 'sequenced_by_organization'}, 'args': None}, 'sequenced_by_contact_name': {'function': 'identity', 'inputs': {'input': 'sequenced_by_contact_name'}, 'args': None}, 'sequenced_by_contact_email': {'function': 'identity', 'inputs': {'input': 'sequenced_by_contact_email'}, 'args': None}, 'raw_sequence_data_processing_method': {'function': 'identity', 'inputs': {'input': 'raw_sequence_data_processing_method'}, 'args': None}, 'dehosting_method': {'function': 'identity', 'inputs': {'input': 'dehosting_method'}, 'args': None}, 'reference_genome_accession': {'function': 'identity', 'inputs': {'input': 'reference_genome_accession'}, 'args': None}, 'consensus_sequence_software_name': {'function': 'identity', 'inputs': {'input': 'consensus_sequence_software_name'}, 'args': None}, 'consensus_sequence_software_version': {'function': 'identity', 'inputs': {'input': 'consensus_sequence_software_version'}, 'args': None}, 'depth_of_coverage': {'function': 'identity', 'inputs': {'input': 'depth_of_coverage'}, 'args': {'type': 'int'}}, 'breadth_of_coverage': {'function': 'identity', 'inputs': {'input': 'breadth_of_coverage'}, 'args': {'type': 'int'}}, 'quality_control_method_name': {'function': 'identity', 'inputs': {'input': 'quality_control_method_name'}, 'args': None}, 'quality_control_method_version': {'function': 'identity', 'inputs': {'input': 'quality_control_method_version'}, 'args': None}, 'quality_control_determination': {'function': 'identity', 'inputs': {'input': 'quality_control_determination'}, 'args': None}, 'quality_control_issues': {'function': 'identity', 'inputs': {'input': 'quality_control_issues'}, 'args': None}, 'quality_control_details': {'function': 'identity', 'inputs': {'input': 'quality_control_details'}, 'args': None}, 'diagnostic_measurement_method': {'function': 'identity', 'inputs': {'input': 'diagnostic_measurement_method'}, 'args': None}, 'diagnostic_target_presence': {'function': 'identity', 'inputs': {'input': 'diagnostic_target_presence'}, 'args': None}, 'diagnostic_target_gene_name': {'function': 'identity', 'inputs': {'input': 'diagnostic_target_gene_name'}, 'args': None}, 'diagnostic_measurement_value': {'function': 'identity', 'inputs': {'input': 'diagnostic_measurement_value'}, 'args': None}, 'diagnostic_measurement_unit': {'function': 'identity', 'inputs': {'input': 'diagnostic_measurement_unit'}, 'args': None}, 'length': {'function': 'identity', 'inputs': {'input': 'length'}, 'args': {'type': 'int'}}, 'host_name_scientific': {'function': 'identity', 'inputs': {'input': 'host_name_scientific'}, 'args': None}, 'host_name_common': {'function': 'identity', 'inputs': {'input': 'host_name_common'}, 'args': None}, 'host_taxon_id': {'function': 'identity', 'inputs': {'input': 'host_taxon_id'}, 'args': {'type': 'int'}}, 'is_lab_host': {'function': 'identity', 'inputs': {'input': 'is_lab_host'}, 'args': None}, 'cell_line': {'function': 'identity', 'inputs': {'input': 'cell_line'}, 'args': None}, 'passage_number': {'function': 'identity', 'inputs': {'input': 'passage_number'}, 'args': {'type': 'int'}}, 'passage_method': {'function': 'identity', 'inputs': {'input': 'passage_method'}, 'args': None}, 'ncbi_protein_count': {'function': 'identity', 'inputs': {'input': 'ncbi_protein_count'}, 'args': {'type': 'int'}}, 'ncbi_sourcedb': {'function': 'identity', 'inputs': {'input': 'ncbi_sourcedb'}, 'args': None}, 'ncbi_virus_name': {'function': 'identity', 'inputs': {'input': 'ncbi_virus_name'}, 'args': None}, 'ncbi_virus_tax_id': {'function': 'identity', 'inputs': {'input': 'ncbi_virus_tax_id'}, 'args': {'type': 'int'}}, 'sra_run_accession': {'function': 'identity', 'inputs': {'input': 'sra_run_accession'}, 'args': None}, 'total_snps': {'function': 'identity', 'inputs': {'input': 'nextclade.totalSubstitutions'}, 'args': {'type': 'int'}}, 'total_inserted_nucs': {'function': 'identity', 'inputs': {'input': 'nextclade.totalInsertions'}, 'args': {'type': 'int'}}, 'total_deleted_nucs': {'function': 'identity', 'inputs': {'input': 'nextclade.totalDeletions'}, 'args': {'type': 'int'}}, 'total_ambiguous_nucs': {'function': 'identity', 'inputs': {'input': 'nextclade.totalNonACGTNs'}, 'args': {'type': 'int'}}, 'total_unknown_nucs': {'function': 'identity', 'inputs': {'input': 'nextclade.totalMissing'}, 'args': {'type': 'int'}}, 'total_frame_shifts': {'function': 'identity', 'inputs': {'input': 'nextclade.totalFrameShifts'}, 'args': {'type': 'int'}}, 'frame_shifts': {'function': 'identity', 'inputs': {'input': 'nextclade.frameShifts'}, 'args': None}, 'completeness': {'function': 'identity', 'inputs': {'input': 'nextclade.coverage'}, 'args': {'type': 'float'}}}, pipeline_version=1)
INFO:root:Downloading Nextclade dataset: ['nextclade3', 'dataset', 'get', '--name=nextstrain/ebola/zaire', '--server=https://raw.githubusercontent.com/nextstrain/nextclade_data/ebola/data_output', '--output-dir=/tmp/tmprm1pv93j']
INFO:root:Nextclade dataset downloaded successfully
DEBUG:root:Fetching unprocessed sequences
DEBUG:root:Fetching 100 unprocessed sequences from http://loculus-backend-service:8079/ebola-zaire/extract-unprocessed-data
DEBUG:root:Requesting JWT from https://authentication-main.loculus.org/realms/loculus/protocol/openid-connect/token
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): authentication-main.loculus.org:443
DEBUG:urllib3.connectionpool:https://authentication-main.loculus.org:443 "POST /realms/loculus/protocol/openid-connect/token HTTP/11" 200 1366
DEBUG:root:JWT fetched successfully.
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): loculus-backend-service:8079
DEBUG:urllib3.connectionpool:http://loculus-backend-service:8079 "POST /ebola-zaire/extract-unprocessed-data HTTP/11" 200 None
DEBUG:charset_normalizer:Encoding detection: utf_8 is most likely the one.
DEBUG:root:Running nextclade: ['nextclade3', 'run', '--output-all=/tmp/tmpdpt4fnvq', '--input-dataset=/tmp/tmprm1pv93j', '--output-translations=/tmp/tmpdpt4fnvq/nextclade.cds_translation.{cds}.fasta', '--jobs=1', '--', '/tmp/tmpdpt4fnvq/input.fasta']
DEBUG:root:Nextclade results available in /tmp/tmpdpt4fnvq
DEBUG:loculus_preprocessing.processing_functions:input_data: {'date': None}
DEBUG:loculus_preprocessing.processing_functions:release_date: None
DEBUG:loculus_preprocessing.processing_functions:date_str:
DEBUG:loculus_preprocessing.processing_functions:input_data: {'date': None}
DEBUG:loculus_preprocessing.processing_functions:release_date: None
DEBUG:loculus_preprocessing.processing_functions:date_str:
DEBUG:root:formatted input data:[None, 'LOC_000SCE9.1', '']
ERROR:loculus_preprocessing.processing_functions:Error calling function concatenatewith input {'geo_loc_country': None, 'sample_collection_date': None} and args {'order': ['geo_loc_country', 'accession_version', 'sample_collection_date'], 'type': ['string', 'string', 'date'], 'accession_version': 'LOC_000SCE9.1'}: sequence item 0: expected str instance, NoneType found
Traceback (most recent call last):
  File "/opt/conda/lib/python3.12/site-packages/loculus_preprocessing/processing_functions.py", line 33, in call_function
    result = func(input_data, output_field, args=args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/loculus_preprocessing/processing_functions.py", line 334, in concatenate
    result = "/".join(formatted_input_data)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: sequence item 0: expected str instance, NoneType found
Traceback (most recent call last):
  File "/opt/conda/lib/python3.12/site-packages/loculus_preprocessing/prepro.py", line 439, in get_metadata
    processing_result = ProcessingFunctions.call_function(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/loculus_preprocessing/processing_functions.py", line 40, in call_function
    if isinstance(result, ProcessingResult):
                  ^^^^^^
UnboundLocalError: cannot access local variable 'result' where it is not associated with a value

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/bin/prepro", line 8, in <module>
    sys.exit(cli_entry())
             ^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/loculus_preprocessing/__main__.py", line 16, in cli_entry
    run(config)
  File "/opt/conda/lib/python3.12/site-packages/loculus_preprocessing/prepro.py", line 608, in run
    processed = process_all(unprocessed, dataset_dir, config)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/loculus_preprocessing/prepro.py", line 562, in process_all
    processed_single = process_single(id, result, config)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/loculus_preprocessing/prepro.py", line 508, in process_single
    processing_result = get_metadata(
                        ^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/loculus_preprocessing/prepro.py", line 444, in get_metadata
    raise RuntimeError(msg) from e
RuntimeError: Processing for spec: ProcessingSpec(inputs={'geo_loc_country': 'geo_loc_country', 'sample_collection_date': 'sample_collection_date'}, function='concatenate', required=False, args={'order': ['geo_loc_country', 'accession_version', 'sample_collection_date'], 'type': ['string', 'string', 'date']}) with input data: {'geo_loc_country': None, 'sample_collection_date': None} failed with cannot access local variable 'result' where it is not associated with a value
anna-parker commented 2 months ago

I think the problem I forgot to handle the case when an input metadata field doesn't exist in the input file (not just that it is empty) in concatenate. And now we have renamed country to geo_loc_country so we hit this edge case - but good this happens now and not on prod! Thanks for catching this! I will create a PR tomorrow :-)

anna-parker commented 2 months ago

Should be fixed now :-)