uio-bmi / immuneML

immuneML is a platform for machine learning analysis of adaptive immune receptor repertoire data.
https://immuneml.uio.no
GNU Affero General Public License v3.0
60 stars 29 forks source link

Error while running the quickstart analysis #172

Closed kvegesan-stjude closed 4 months ago

kvegesan-stjude commented 5 months ago

Hello, I've just started using this package and the installation went well. When I tried to run the quickstart analysis I kept running into the error shown below.

I think the error happens when the program tries to read the synthetic dataset in AIRR format, but there is some issue with the way the columns are specified.

I investigated the synthetic airr file rep_0.tsv and found that the sequence_id column has some weird issues. This is an example of the file contents:

sequence_id sequence    rev_comp    productive  v_call  d_call  j_call  sequence_alignment  germline_alignment  junction    junction_aa v_cigar d_cigar j_cigar cdr3_aa locus   duplicate_count vj_in_frame stop_codon  my_signal
6           T   TRBV1-1*01      TRBJ1-1*01                              FYRVSIWQQENE    TRB 1   T   F   False
95208f3bd4b24b45b5120567057adffe            T   TRBV1-1*01      TRBJ1-1*01                              LWAARKFVRG  TRB 1   T   F   True

This is my output. Any help is appreciated.

(immuneml_env) [immuneML]$ immune-ml-quickstart ./quickstart_results/
immuneML quickstart: generating a synthetic dataset...
2024-05-05 20:22:13.029352: Setting temporary cache path to quickstart_results/synthetic_dataset/result/cache
2024-05-05 20:22:13.029383: ImmuneML: parsing the specification...

2024-05-05 20:22:13.752929: Imported repertoire dataset my_synthetic_dataset with 100 examples.
2024-05-05 20:22:13.876557: Full specification is available at quickstart_results/synthetic_dataset/result/full_simulation_specs.yaml.

2024-05-05 20:22:13.876602: ImmuneML: starting the analysis...

2024-05-05 20:22:13.876629: Instruction 1/1 has started.
2024-05-05 20:22:15.137774: Instruction 1/1 has finished.
2024-05-05 20:22:15.151792: Generating HTML reports...
2024-05-05 20:22:15.194902: HTML reports are generated.
2024-05-05 20:22:15.195323: ImmuneML: finished analysis.

immuneML quickstart: finished generating a synthetic dataset.
immuneML quickstart: training a machine learning model...
2024-05-05 20:22:15.201168: Setting temporary cache path to quickstart_results/machine_learning_analysis/result/cache
2024-05-05 20:22:15.201184: ImmuneML: parsing the specification...

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ImportHelper.py", line 183, in load_sequence_dataframe
    df = alternative_load_func(filepath, params)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/IO/dataset_import/AIRRImport.py", line 159, in alternative_load_func
    df = airr.load_rearrangement(filename)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/airr/interface.py", line 103, in load_rearrangement
    df = pd.read_csv(filename, sep='\t', header=0, index_col=None,
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
    return _read(filepath_or_buffer, kwds)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 583, in _read
    return parser.read(nrows)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1704, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1036, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1075, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1220, in pandas._libs.parsers.TextReader._convert_with_dtype
ValueError: Bool column has NA values in column 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ImportHelper.py", line 164, in load_repertoire_as_object
    dataframe = ImportHelper.load_sequence_dataframe(filename, params, alternative_load_func)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ImportHelper.py", line 187, in load_sequence_dataframe
    raise Exception(f"{ex}\n\nImportHelper: an error occurred during dataset import while parsing the input file: {filepath}.\n"
Exception: Bool column has NA values in column 2

ImportHelper: an error occurred during dataset import while parsing the input file: quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr/repertoires/rep_0.tsv.
Please make sure this is a correct immune receptor data file (not metadata).
The parameters used for import are DatasetImportParams(path=PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr'), is_repertoire=True, metadata_file=PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr/metadata.csv'), paired=False, receptor_chains=None, result_path=PosixPath('quickstart_results/machine_learning_analysis/result/datasets/d1'), columns_to_load=None, separator='\t', column_mapping={'junction': 'sequences', 'junction_aa': 'sequence_aas', 'v_call': 'v_alleles', 'j_call': 'j_alleles', 'locus': 'chains', 'duplicate_count': 'counts', 'sequence_id': 'sequence_identifiers'}, column_mapping_synonyms=None, region_type=<RegionType.IMGT_CDR3: 'IMGT_CDR3'>, import_productive=True, import_unproductive=None, import_with_stop_codon=False, import_out_of_frame=False, import_illegal_characters=False, metadata_column_mapping=None, number_of_processes=1, sequence_file_size=50000, organism=None, import_empty_nt_sequences=True, import_empty_aa_sequences=False).
For technical description of the error, see the log above. For details on how to specify the dataset import, see the documentation.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File ".conda/envs/immuneml_env/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File ".conda/envs/immuneml_env/lib/python3.8/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ImportHelper.py", line 177, in load_repertoire_as_object
    raise RuntimeError(f"{ImportHelper.__name__}: error when importing file {metadata_row['filename']}.") from exception
RuntimeError: ImportHelper: error when importing file rep_0.tsv.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/import_parsers/ImportParser.py", line 60, in _parse_dataset
    dataset = import_cls.import_dataset(params, key)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/IO/dataset_import/AIRRImport.py", line 109, in import_dataset
    return ImportHelper.import_dataset(AIRRImport, params, dataset_name)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ImportHelper.py", line 49, in import_dataset
    dataset = ImportHelper.import_repertoire_dataset(import_class, processed_params, dataset_name)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ImportHelper.py", line 95, in import_repertoire_dataset
    repertoires = pool.starmap(ImportHelper.load_repertoire_as_object, arguments)
  File ".conda/envs/immuneml_env/lib/python3.8/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File ".conda/envs/immuneml_env/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
RuntimeError: ImportHelper: error when importing file rep_0.tsv.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/Logger.py", line 10, in wrapped
    return func(*args, **kwargs)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/import_parsers/ImportParser.py", line 70, in _parse_dataset
    raise Exception(f"{ex}\n\nAn error occurred while parsing the dataset {key}. See the log above for more details.")
Exception: ImportHelper: error when importing file rep_0.tsv.

An error occurred while parsing the dataset d1. See the log above for more details.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File ".conda/envs/immuneml_env/bin/immune-ml-quickstart", line 11, in <module>
    sys.exit(main())
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/workflows/instructions/quickstart.py", line 167, in main
    quickstart.run(sys.argv[1] if len(sys.argv) == 2 else None)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/workflows/instructions/quickstart.py", line 160, in run
    app.run()
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/app/ImmuneMLApp.py", line 44, in run
    symbol_table, self._specification_path = ImmuneMLParser.parse_yaml_file(self._specification_path, self._result_path)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/ImmuneMLParser.py", line 119, in parse_yaml_file
    symbol_table, path = ImmuneMLParser.parse(workflow_specification, file_path, result_path)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/ImmuneMLParser.py", line 141, in parse
    def_parser_output, specs_defs = DefinitionParser.parse(workflow_specification, symbol_table, result_path)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/definition_parsers/DefinitionParser.py", line 48, in parse
    symbol_table, specs_import = ImportParser.parse(specs, symbol_table, result_path)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/import_parsers/ImportParser.py", line 27, in parse
    symbol_table = ImportParser._parse_dataset(key, workflow_specification[ImportParser.keyword][key], symbol_table, result_path)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/Logger.py", line 14, in wrapped
    raise Exception(f"{e}\n\n"
Exception: ImportHelper: error when importing file rep_0.tsv.

An error occurred while parsing the dataset d1. See the log above for more details.

ImmuneMLParser: an error occurred during parsing in function _parse_dataset  with parameters: ('d1', {'format': 'AIRR', 'params': {'is_repertoire': True, 'path': PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr'), 'paired': False, 'import_productive': True, 'import_with_stop_codon': False, 'import_out_of_frame': False, 'import_illegal_characters': False, 'region_type': 'IMGT_CDR3', 'separator': '\t', 'column_mapping': {'junction': 'sequences', 'junction_aa': 'sequence_aas', 'v_call': 'v_alleles', 'j_call': 'j_alleles', 'locus': 'chains', 'duplicate_count': 'counts', 'sequence_id': 'sequence_identifiers'}, 'import_empty_nt_sequences': True, 'import_empty_aa_sequences': False, 'metadata_file': PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr/metadata.csv'), 'result_path': PosixPath('quickstart_results/machine_learning_analysis/result/datasets/d1')}}, SymbolTable(), PosixPath('quickstart_results/machine_learning_analysis/result')).

For more details on how to write the specification, see the documentation. For technical description of the error, see the log above.
LonnekeScheffer commented 5 months ago

Hi kvegesan-stjude, thanks for reaching out! I ran the Quickstart myself locally and did not experience this issue. I think you are using an older version of immuneML. We have recently made a lot of major updates, including a different internal format for storing the files and changes to the simulation instruction, both affecting how the Quickstart works internally. Could you update to the latest version (3.0.0a4) and let me know if you are still experiencing this issue?

kvegesan-stjude commented 5 months ago

Thank you for the quick response. I have installed immuneml to 3.0.0a4 in a fresh environment. The default version on pip and conda is 2.2.6

(immuneml_env) [kvegesan@noderome105 immuneML]$ conda list|grep immune
# packages in environment at /home/kvegesan/.conda/envs/immuneml_env:
immuneml                  3.0.0a4                  pypi_0    pypi

I still get the same error. This is the output of log.txt

2024-05-09 10:43:55,943 ERROR: 

--- Exception in parse_dataset : ImportHelper: error when importing file c3178776d43f4b0d94983f8220fc7d3d.tsv: Bool column has NA values in column 2

ImportHelper: an error occurred during dataset import while parsing the input file: quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr/repertoires/c3178776d43f4b0d94983f8220fc7d3d.tsv.
Please make sure this is a correct immune receptor data file (not metadata).
The parameters used for import are DatasetImportParams(path=PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr'), is_repertoire=True, metadata_file=PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr/metadata.csv'), paired=False, receptor_chains=None, result_path=PosixPath('quickstart_results/machine_learning_analysis/result/datasets/d1'), columns_to_load=None, separator='\t', column_mapping={'junction': 'sequence', 'junction_aa': 'sequence_aa', 'locus': 'chain'}, column_mapping_synonyms=None, region_type=<RegionType.IMGT_CDR3: 'IMGT_CDR3'>, import_productive=True, import_unknown_productivity=True, import_unproductive=None, import_with_stop_codon=False, import_out_of_frame=False, import_illegal_characters=True, metadata_column_mapping=None, number_of_processes=1, sequence_file_size=50000, organism=None, import_empty_nt_sequences=True, import_empty_aa_sequences=False).
For technical description of the error, see the log above. For details on how to specify the dataset import, see the documentation.

An error occurred while parsing the dataset d1. See the log above for more details.

This is the full log

(immuneml_env) [kvegesan@noderome105 immuneML]$ immune-ml-quickstart ./quickstart_results/ > log.txt
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ImportHelper.py", line 151, in load_sequence_dataframe
    df = alternative_load_func(filepath, params)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/IO/dataset_import/AIRRImport.py", line 155, in alternative_load_func
    df = airr.load_rearrangement(filename)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/airr/interface.py", line 103, in load_rearrangement
    df = pd.read_csv(filename, sep='\t', header=0, index_col=None,
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 583, in _read
    return parser.read(nrows)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1704, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1036, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1075, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1220, in pandas._libs.parsers.TextReader._convert_with_dtype
ValueError: Bool column has NA values in column 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ImportHelper.py", line 130, in load_repertoire_as_object
    dataframe = ImportHelper.load_sequence_dataframe(filename, params, alternative_load_func)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ImportHelper.py", line 155, in load_sequence_dataframe
    raise Exception(
Exception: Bool column has NA values in column 2

ImportHelper: an error occurred during dataset import while parsing the input file: quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr/repertoires/c3178776d43f4b0d94983f8220fc7d3d.tsv.
Please make sure this is a correct immune receptor data file (not metadata).
The parameters used for import are DatasetImportParams(path=PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr'), is_repertoire=True, metadata_file=PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr/metadata.csv'), paired=False, receptor_chains=None, result_path=PosixPath('quickstart_results/machine_learning_analysis/result/datasets/d1'), columns_to_load=None, separator='\t', column_mapping={'junction': 'sequence', 'junction_aa': 'sequence_aa', 'locus': 'chain'}, column_mapping_synonyms=None, region_type=<RegionType.IMGT_CDR3: 'IMGT_CDR3'>, import_productive=True, import_unknown_productivity=True, import_unproductive=None, import_with_stop_codon=False, import_out_of_frame=False, import_illegal_characters=True, metadata_column_mapping=None, number_of_processes=1, sequence_file_size=50000, organism=None, import_empty_nt_sequences=True, import_empty_aa_sequences=False).
For technical description of the error, see the log above. For details on how to specify the dataset import, see the documentation.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ImportHelper.py", line 144, in load_repertoire_as_object
    raise RuntimeError(
RuntimeError: ImportHelper: error when importing file c3178776d43f4b0d94983f8220fc7d3d.tsv: Bool column has NA values in column 2

ImportHelper: an error occurred during dataset import while parsing the input file: quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr/repertoires/c3178776d43f4b0d94983f8220fc7d3d.tsv.
Please make sure this is a correct immune receptor data file (not metadata).
The parameters used for import are DatasetImportParams(path=PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr'), is_repertoire=True, metadata_file=PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr/metadata.csv'), paired=False, receptor_chains=None, result_path=PosixPath('quickstart_results/machine_learning_analysis/result/datasets/d1'), columns_to_load=None, separator='\t', column_mapping={'junction': 'sequence', 'junction_aa': 'sequence_aa', 'locus': 'chain'}, column_mapping_synonyms=None, region_type=<RegionType.IMGT_CDR3: 'IMGT_CDR3'>, import_productive=True, import_unknown_productivity=True, import_unproductive=None, import_with_stop_codon=False, import_out_of_frame=False, import_illegal_characters=True, metadata_column_mapping=None, number_of_processes=1, sequence_file_size=50000, organism=None, import_empty_nt_sequences=True, import_empty_aa_sequences=False).
For technical description of the error, see the log above. For details on how to specify the dataset import, see the documentation.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/import_parsers/ImportParser.py", line 59, in parse_dataset
    dataset = import_cls.import_dataset(params, key)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/IO/dataset_import/AIRRImport.py", line 105, in import_dataset
    return ImportHelper.import_dataset(AIRRImport, params, dataset_name)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ImportHelper.py", line 51, in import_dataset
    dataset = ImportHelper.import_repertoire_dataset(import_class, processed_params, dataset_name)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ImportHelper.py", line 98, in import_repertoire_dataset
    repertoires = pool.starmap(ImportHelper.load_repertoire_as_object, arguments)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
RuntimeError: ImportHelper: error when importing file c3178776d43f4b0d94983f8220fc7d3d.tsv: Bool column has NA values in column 2

ImportHelper: an error occurred during dataset import while parsing the input file: quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr/repertoires/c3178776d43f4b0d94983f8220fc7d3d.tsv.
Please make sure this is a correct immune receptor data file (not metadata).
The parameters used for import are DatasetImportParams(path=PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr'), is_repertoire=True, metadata_file=PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr/metadata.csv'), paired=False, receptor_chains=None, result_path=PosixPath('quickstart_results/machine_learning_analysis/result/datasets/d1'), columns_to_load=None, separator='\t', column_mapping={'junction': 'sequence', 'junction_aa': 'sequence_aa', 'locus': 'chain'}, column_mapping_synonyms=None, region_type=<RegionType.IMGT_CDR3: 'IMGT_CDR3'>, import_productive=True, import_unknown_productivity=True, import_unproductive=None, import_with_stop_codon=False, import_out_of_frame=False, import_illegal_characters=True, metadata_column_mapping=None, number_of_processes=1, sequence_file_size=50000, organism=None, import_empty_nt_sequences=True, import_empty_aa_sequences=False).
For technical description of the error, see the log above. For details on how to specify the dataset import, see the documentation.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/Logger.py", line 10, in wrapped
    return func(*args, **kwargs)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/import_parsers/ImportParser.py", line 68, in parse_dataset
    raise Exception(f"{ex}\n\nAn error occurred while parsing the dataset {key}. See the log above for more details.")
Exception: ImportHelper: error when importing file c3178776d43f4b0d94983f8220fc7d3d.tsv: Bool column has NA values in column 2

ImportHelper: an error occurred during dataset import while parsing the input file: quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr/repertoires/c3178776d43f4b0d94983f8220fc7d3d.tsv.
Please make sure this is a correct immune receptor data file (not metadata).
The parameters used for import are DatasetImportParams(path=PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr'), is_repertoire=True, metadata_file=PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr/metadata.csv'), paired=False, receptor_chains=None, result_path=PosixPath('quickstart_results/machine_learning_analysis/result/datasets/d1'), columns_to_load=None, separator='\t', column_mapping={'junction': 'sequence', 'junction_aa': 'sequence_aa', 'locus': 'chain'}, column_mapping_synonyms=None, region_type=<RegionType.IMGT_CDR3: 'IMGT_CDR3'>, import_productive=True, import_unknown_productivity=True, import_unproductive=None, import_with_stop_codon=False, import_out_of_frame=False, import_illegal_characters=True, metadata_column_mapping=None, number_of_processes=1, sequence_file_size=50000, organism=None, import_empty_nt_sequences=True, import_empty_aa_sequences=False).
For technical description of the error, see the log above. For details on how to specify the dataset import, see the documentation.

An error occurred while parsing the dataset d1. See the log above for more details.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/kvegesan/.conda/envs/immuneml_env/bin/immune-ml-quickstart", line 8, in <module>
    sys.exit(main())
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/workflows/instructions/quickstart.py", line 198, in main
    quickstart.run(sys.argv[1] if len(sys.argv) == 2 else None)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/workflows/instructions/quickstart.py", line 191, in run
    app.run()
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/app/ImmuneMLApp.py", line 45, in run
    symbol_table, self._specification_path = ImmuneMLParser.parse_yaml_file(self._specification_path, self._result_path)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/ImmuneMLParser.py", line 119, in parse_yaml_file
    symbol_table, path = ImmuneMLParser.parse(workflow_specification, file_path, result_path)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/ImmuneMLParser.py", line 141, in parse
    def_parser_output, specs_defs = DefinitionParser.parse(workflow_specification, symbol_table, result_path)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/definition_parsers/DefinitionParser.py", line 51, in parse
    symbol_table, new_specs = DefinitionParser._call_if_exists(parser.keyword, parser.parse, specs,
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/definition_parsers/DefinitionParser.py", line 61, in _call_if_exists
    return method(specs[key], symbol_table, path)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/import_parsers/ImportParser.py", line 25, in parse
    dataset = ImportParser.parse_dataset(key, workflow_specification[key], path)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/Logger.py", line 14, in wrapped
    raise Exception(f"{e}\n\n"
Exception: ImportHelper: error when importing file c3178776d43f4b0d94983f8220fc7d3d.tsv: Bool column has NA values in column 2

ImportHelper: an error occurred during dataset import while parsing the input file: quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr/repertoires/c3178776d43f4b0d94983f8220fc7d3d.tsv.
Please make sure this is a correct immune receptor data file (not metadata).
The parameters used for import are DatasetImportParams(path=PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr'), is_repertoire=True, metadata_file=PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr/metadata.csv'), paired=False, receptor_chains=None, result_path=PosixPath('quickstart_results/machine_learning_analysis/result/datasets/d1'), columns_to_load=None, separator='\t', column_mapping={'junction': 'sequence', 'junction_aa': 'sequence_aa', 'locus': 'chain'}, column_mapping_synonyms=None, region_type=<RegionType.IMGT_CDR3: 'IMGT_CDR3'>, import_productive=True, import_unknown_productivity=True, import_unproductive=None, import_with_stop_codon=False, import_out_of_frame=False, import_illegal_characters=True, metadata_column_mapping=None, number_of_processes=1, sequence_file_size=50000, organism=None, import_empty_nt_sequences=True, import_empty_aa_sequences=False).
For technical description of the error, see the log above. For details on how to specify the dataset import, see the documentation.

An error occurred while parsing the dataset d1. See the log above for more details.

ImmuneMLParser: an error occurred during parsing in function parse_dataset  with parameters: ('d1', {'format': 'AIRR', 'params': {'is_repertoire': True, 'path': PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr'), 'paired': False, 'import_productive': True, 'import_unknown_productivity': True, 'import_with_stop_codon': False, 'import_out_of_frame': False, 'import_illegal_characters': True, 'region_type': 'IMGT_CDR3', 'separator': '\t', 'column_mapping': {'junction': 'sequence', 'junction_aa': 'sequence_aa', 'locus': 'chain'}, 'import_empty_nt_sequences': True, 'import_empty_aa_sequences': False, 'metadata_file': PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr/metadata.csv'), 'result_path': PosixPath('quickstart_results/machine_learning_analysis/result/datasets/d1')}}, PosixPath('quickstart_results/machine_learning_analysis/result')).

For more details on how to write the specification, see the documentation. For technical description of the error, see the log above.
LonnekeScheffer commented 5 months ago

Thanks for the info! I'll have a bit more time to look into the details tomorrow. It's a little challenging to debug since I'm not experiencing the same issue on my side with this version, so it could help me a lot if you could share the following with me:

also, what operating system are you using? windows, linux, mac?

kvegesan-stjude commented 5 months ago

These are the packages I have in the environment. env.txt

This is the zipped file of the quickstart analysis. There is one log file, but I'm not sure if there are others. quickstart_results.zip

I'm on a RedHat Enterprise linux 8 environment. This is my organizations computing cluster.

LonnekeScheffer commented 4 months ago

Hi kvegesan-stjude, I haven't been able to reproduce the issue yet, but I suspect may be triggered by the fact that simulated sequences in repertoires have an unknown status for the fields "productive"/"vj_in_frame" when exported to AIRR format (resulting in a mix of "True" and "nan" values). I haven't pinpointed yet why this results in an error for you and not for me, but I'm almost certain it's due to some dependency version difference, and I will need to spend a bit more time next week to figure this out.

To help me confirm if this is the root of the issue, would you be able to run the following 3 tiny examples, and let me know which of them work or fail and with what errors: debugging_example.zip These example runs simply import and export a dataset consisting of 1 tiny repertoire each.

LonnekeScheffer commented 4 months ago

Could you try reinstalling the airr dependency? We are using the same version (1.3.1), but your traceback seems to indicate that your airr installation internally tries to call pandas. In my airr installation, it's not calling pandas but there are some commented out lines which do so. I wonder if there may be multiple airr packages installed simultaneously (try running "pip uninstall airr" several times).

kvegesan-stjude commented 4 months ago

Removing airr and reinstalling it did the trick. I was able to run the quickstart.

I also ran the 3 examples. Spec1 failed due to a type mismatch error:

2024-05-10 11:41:15.597605: Running immuneML version 3.0.0a4

2024-05-10 11:41:15.597962: Setting temporary cache path to spec1/cache
2024-05-10 11:41:15.598005: immuneML: parsing the specification...

2024-05-10 11:41:16.126492:
Imported repertoire dataset my_dataset:
Example count: 1
Labels: {'my_signal', 'sim_item', 'type_dict'}
Traceback (most recent call last):
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/Logger.py", line 10, in wrapped
    return func(*args, **kwargs)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/InstructionParser.py", line 67, in parse_instruction
    instruction_object = parser.parse(key, instruction, symbol_table, path)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/instruction_parsers/DatasetExportParser.py", line 65, in parse
    ParameterValidator.assert_type_and_value(instruction["number_of_processes"], int, location, "number_of_processes", 1)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ParameterValidator.py", line 42, in assert_type_and_value
    assert isinstance(value, parameter_type),  f"{base_mssg}It has to be of type {type_name}, but is now of type {type(value).__name__}."
AssertionError: DatasetExportParser: None is not a valid value for parameter number_of_processes. It has to be of type int, but is now of type NoneType.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/kvegesan/.conda/envs/immuneml_env/bin/immune-ml", line 8, in <module>
    sys.exit(main())
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/app/ImmuneMLApp.py", line 90, in main
    run_immuneML(namespace)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/app/ImmuneMLApp.py", line 75, in run_immuneML
    app.run()
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/app/ImmuneMLApp.py", line 45, in run
    symbol_table, self._specification_path = ImmuneMLParser.parse_yaml_file(self._specification_path, self._result_path)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/ImmuneMLParser.py", line 119, in parse_yaml_file
    symbol_table, path = ImmuneMLParser.parse(workflow_specification, file_path, result_path)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/ImmuneMLParser.py", line 142, in parse
    symbol_table, specs_instructions = InstructionParser.parse(def_parser_output, result_path)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/InstructionParser.py", line 50, in parse
    InstructionParser.parse_instruction(key, specification[InstructionParser.keyword][key], symbol_table, path)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/Logger.py", line 14, in wrapped
    raise Exception(f"{e}\n\n"
Exception: DatasetExportParser: None is not a valid value for parameter number_of_processes. It has to be of type int, but is now of type NoneType.

ImmuneMLParser: an error occurred during parsing in function parse_instruction  with parameters: ('export_dataset', {'type': 'DatasetExport', 'datasets': ['my_dataset'], 'number_of_processes': None, 'export_formats': ['AIRR']}, SymbolTable(), PosixPath('spec1')).

For more details on how to write the specification, see the documentation. For technical description of the error, see the log above.

Spec2 also failed with the same error:

(immuneml_env) [kvegesan@noderome105 debugging_example]$ immune-ml spec2.yaml spec2/
2024-05-10 11:42:28.438035: Running immuneML version 3.0.0a4

2024-05-10 11:42:28.438386: Setting temporary cache path to spec2/cache
2024-05-10 11:42:28.438432: immuneML: parsing the specification...

2024-05-10 11:42:28.777952:
Imported repertoire dataset my_dataset:
Example count: 1
Labels: {'my_signal', 'sim_item', 'type_dict'}
Traceback (most recent call last):
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/Logger.py", line 10, in wrapped
    return func(*args, **kwargs)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/InstructionParser.py", line 67, in parse_instruction
    instruction_object = parser.parse(key, instruction, symbol_table, path)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/instruction_parsers/DatasetExportParser.py", line 65, in parse
    ParameterValidator.assert_type_and_value(instruction["number_of_processes"], int, location, "number_of_processes", 1)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ParameterValidator.py", line 42, in assert_type_and_value
    assert isinstance(value, parameter_type),  f"{base_mssg}It has to be of type {type_name}, but is now of type {type(value).__name__}."
AssertionError: DatasetExportParser: None is not a valid value for parameter number_of_processes. It has to be of type int, but is now of type NoneType.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/kvegesan/.conda/envs/immuneml_env/bin/immune-ml", line 8, in <module>
    sys.exit(main())
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/app/ImmuneMLApp.py", line 90, in main
    run_immuneML(namespace)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/app/ImmuneMLApp.py", line 75, in run_immuneML
    app.run()
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/app/ImmuneMLApp.py", line 45, in run
    symbol_table, self._specification_path = ImmuneMLParser.parse_yaml_file(self._specification_path, self._result_path)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/ImmuneMLParser.py", line 119, in parse_yaml_file
    symbol_table, path = ImmuneMLParser.parse(workflow_specification, file_path, result_path)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/ImmuneMLParser.py", line 142, in parse
    symbol_table, specs_instructions = InstructionParser.parse(def_parser_output, result_path)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/InstructionParser.py", line 50, in parse
    InstructionParser.parse_instruction(key, specification[InstructionParser.keyword][key], symbol_table, path)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/Logger.py", line 14, in wrapped
    raise Exception(f"{e}\n\n"
Exception: DatasetExportParser: None is not a valid value for parameter number_of_processes. It has to be of type int, but is now of type NoneType.

ImmuneMLParser: an error occurred during parsing in function parse_instruction  with parameters: ('export_dataset', {'type': 'DatasetExport', 'datasets': ['my_dataset'], 'number_of_processes': None, 'export_formats': ['AIRR']}, SymbolTable(), PosixPath('spec2')).

For more details on how to write the specification, see the documentation. For technical description of the error, see the log above.

Spec3 also had the same error:

(immuneml_env) [kvegesan@noderome105 debugging_example]$ immune-ml spec3.yaml spec3/
2024-05-10 11:44:37.764905: Running immuneML version 3.0.0a4

2024-05-10 11:44:37.765271: Setting temporary cache path to spec3/cache
2024-05-10 11:44:37.765318: immuneML: parsing the specification...

2024-05-10 11:44:38.129818:
Imported repertoire dataset my_dataset:
Example count: 1
Labels: {'type_dict', 'my_signal', 'sim_item'}
Traceback (most recent call last):
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/Logger.py", line 10, in wrapped
    return func(*args, **kwargs)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/InstructionParser.py", line 67, in parse_instruction
    instruction_object = parser.parse(key, instruction, symbol_table, path)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/instruction_parsers/DatasetExportParser.py", line 65, in parse
    ParameterValidator.assert_type_and_value(instruction["number_of_processes"], int, location, "number_of_processes", 1)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ParameterValidator.py", line 42, in assert_type_and_value
    assert isinstance(value, parameter_type),  f"{base_mssg}It has to be of type {type_name}, but is now of type {type(value).__name__}."
AssertionError: DatasetExportParser: None is not a valid value for parameter number_of_processes. It has to be of type int, but is now of type NoneType.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/kvegesan/.conda/envs/immuneml_env/bin/immune-ml", line 8, in <module>
    sys.exit(main())
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/app/ImmuneMLApp.py", line 90, in main
    run_immuneML(namespace)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/app/ImmuneMLApp.py", line 75, in run_immuneML
    app.run()
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/app/ImmuneMLApp.py", line 45, in run
    symbol_table, self._specification_path = ImmuneMLParser.parse_yaml_file(self._specification_path, self._result_path)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/ImmuneMLParser.py", line 119, in parse_yaml_file
    symbol_table, path = ImmuneMLParser.parse(workflow_specification, file_path, result_path)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/ImmuneMLParser.py", line 142, in parse
    symbol_table, specs_instructions = InstructionParser.parse(def_parser_output, result_path)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/InstructionParser.py", line 50, in parse
    InstructionParser.parse_instruction(key, specification[InstructionParser.keyword][key], symbol_table, path)
  File "/home/kvegesan/.conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/Logger.py", line 14, in wrapped
    raise Exception(f"{e}\n\n"
Exception: DatasetExportParser: None is not a valid value for parameter number_of_processes. It has to be of type int, but is now of type NoneType.

ImmuneMLParser: an error occurred during parsing in function parse_instruction  with parameters: ('export_dataset', {'type': 'DatasetExport', 'datasets': ['my_dataset'], 'number_of_processes': None, 'export_formats': ['AIRR']}, SymbolTable(), PosixPath('spec3')).

For more details on how to write the specification, see the documentation. For technical description of the error, see the log above.
LonnekeScheffer commented 4 months ago

Good to hear that reinstalling the airr dependency resolved the issue. I don't think there is any bug that needs to be fixed on the immuneML side in this case.

Apologies for the confusion about the YAML examples, I didn't test run those and it looks like I forgot the number_of_processes parameter (I thought there was a default value). If you add that parameter (example here: https://docs.immuneml.uio.no/latest/yaml_specs/instructions.html#datasetexport) I believe all those YAMLs should run without issues now, could you give it a try?