ImmuneML: parsing the specification...

Genokarma commented 7 months ago

I attempted to train a model with ImmuneSEQRearrangement data using ImmuneML. The process has been running for more than 5 hours without producing any additional details or output. It seems to be stuck, and I'm unable to determine the status of the task.

Logs in docker: 2024-01-15 18:14:11 2024-01-15 12:44:11.790241: Setting temporary cache path to data/results/cache 2024-01-15 18:14:11 2024-01-15 12:44:11.791161: ImmuneML: parsing the specification... 2024-01-15 18:14:11

Log.txt in results folder: 2024-01-15 12:44:11,789 INFO: Setting temporary cache path to data/results/cache 2024-01-15 12:44:11,790 INFO: ImmuneML: parsing the specification...

2024-01-15 12:44:11,803 INFO: --- Entering: parse with parameters ({}, SymbolTable()) 2024-01-15 12:44:11,803 INFO: --- Exiting: parse 2024-01-15 12:44:11,803 INFO: --- Entering: parse_encoder with parameters ('encoding_1', {'KmerFrequency': {'k': 3, 'reads': 'all', 'sequence_encoding': 'CONTINUOUS_KMER'}}) 2024-01-15 12:44:11,970 INFO: --- Exiting: parse_encoder 2024-01-15 12:44:11,970 INFO: --- Entering: _parse_ml_method with parameters ('k_nearest_neighbors', {'KNN': {'n_neighbors': [3, 5, 7], 'show_warnings': False}, 'model_selection_cv': True, 'model_selection_n_folds': 5})

LonnekeScheffer commented 7 months ago

Hi Genokarma,

Thanks for reaching out! You mentioned the process has been running for 5 hours, but depending on the size of the dataset and specific methods and parameters used, some processes may be very computationally expensive and can indeed run for a long time. Since I don't have more information on the analysis you are trying to run, I cannot help you determine what the cause of this long running time may be. If you would like my input on that, you're welcome to share the YAML analysis specification with me.

As for debugging the problem: You could try running a small example to ensure everything works, for example using only a small number of repertoires or sequences. There is also an automatic test instruction which can be run, to check if the immuneML installation works at all: https://docs.immuneml.uio.no/latest/installation/install_with_package_manager.html#testing-immuneml

Since this issue as of now does not point towards a concrete bug, I will close it for now. Feel free to reach out on contact@immuneml.uio.no if you have more questions.

Genokarma commented 7 months ago

Hi LonnekeScheffer, Its been more than 12 hours and still showing same thing My sample details are: 15 breast cancer study samples; 30 Control samples. Here I have attached my yaml specifications 1.converted ImmunoSEQRearrangement data in to ImmuneML format Script for conversion: definitions: datasets: dataset: format: ImmunoSEQSample params: is_repertoire: true metadata_file: /data/metadata2.csv path: /data/DataS2/ region_type: IMGT_CDR3 result_path: /data/ instructions: my_dataset_generation_instruction: datasets:

dataset export_formats:
ImmuneML type: DatasetExport

2.Use following script to train model definitions: datasets: dataset: format: ImmuneML params: path: /data/ result_path: /data/results encodings: encoding_1: KmerFrequency: k: 3 reads: all sequence_encoding: CONTINUOUS_KMER ml_methods: k_nearest_neighbors: KNN: n_neighbors:

3
5
7 show_warnings: true model_selection_cv: true model_selection_n_folds: 5 logistic_regression: LogisticRegression: C:
0.01
0.1
1
10
100 class_weight:
balanced penalty:
l1 show_warnings: true model_selection_cv: true model_selection_n_folds: 5 random_forest: RandomForestClassifier: class_weight:
balanced n_estimators:
10
50
100 show_warnings: true model_selection_cv: true model_selection_n_folds: 5 support_vector_machine: SVC: C:
0.01
0.1
1
10
100 class_weight:
balanced dual: false penalty:
l1 show_warnings: true model_selection_cv: true model_selection_n_folds: 5 motifs: {} preprocessing_sequences: {} reports: benchmark: MLSettingsPerformance: name: benchmark single_axis_labels: false x_label_position: -0.12 y_label_position: -0.08 coefficients: Coefficients: coefs_to_plot:
N_LARGEST n_largest:
25 name: coefficients signals: {} simulations: {} instructions: inst1: assessment: reports: models:
coefficients split_count: 5 split_strategy: random training_percentage: 0.7 dataset: dataset labels:
- signal_disease metrics: [] number_of_processes: 10 optimization_metric: accuracy refit_optimal_model: true reports:
- benchmark selection: split_count: 1 split_strategy: random training_percentage: 0.7 settings:
- encoding: encoding_1 ml_method: random_forest preprocessing: null
- encoding: encoding_1 ml_method: logistic_regression preprocessing: null
- encoding: encoding_1 ml_method: support_vector_machine preprocessing: null
- encoding: encoding_1 ml_method: k_nearest_neighbors preprocessing: null strategy: GridSearch type: TrainMLModel output: format: HTML

LonnekeScheffer commented 7 months ago

Hi Genokarma,

I don't think there is necessarily any reason why this should not work. The dataset does not seem extremely large. For debugging purposes, I recommend the following steps:

kill the existing run
make sure you have the latest version of immuneML installed
test if the immuneML installation works correctly according to the documentation: https://docs.immuneml.uio.no/latest/installation/install_with_package_manager.html#testing-immuneml
try running the quickstart example: https://docs.immuneml.uio.no/latest/quickstart/cli_yaml.html
try to run the TrainMLModel instruction with a minimal example, for instance, only running logistic regression, or perhaps a smaller dataset as well.

By following these steps, we can pinpoint where the issue might be (e.g., if there is something wrong with the installation, the computer setup, or the dataset). I don't believe there is a bug in immuneML that is causing this, since everything runs like normal on our end, but if we do find such indication we will of course fix it as soon as possible.

As a side note, it is not necessary to convert the dataset to immuneML format first (you can simply use the ImmunoSEQSample import in the same yaml as where the training happens), although it should work like this as well. Also, you have set the number of processes to 10, which may be alright, but please make sure the system you are running this on supports that number of CPUs (specifying too many processes can also slow down the runtime).

Genokarma commented 7 months ago

Hi LonnekeScheffer, I want to express my gratitude for your assistance; your time and efforts are highly appreciated. For your reference, I've attached my dataset files and YAML script. I am utilizing a Docker container, and the command details are provided in the attached README file. Link for dataset and yaml file is https://github.com/Genokarma/ImmuneMLTest

I've encountered an issue while running the process on two different systems. On my MacOS system with 16 CPUs and 16GB RAM, the process gets stuck at "parsing the specification." On the Linux system with 48GB RAM, it encounters an issue with encoding (encoding 1...). It's been more than 24 hrs but not progress.

I have attempted to troubleshoot the problem on both systems without success. Could you please attempt to execute the process or provide any suggestions to address this issue? Your assistance in resolving this issue is invaluable.

Genokarma commented 7 months ago

Hello again LonnekeScheffer,

I want to express my gratitude for your assistance; your time and efforts are highly appreciated. For your reference, I've attached my dataset files and YAML script. I am utilizing a Docker container, and the command details are provided in the attached README file. Link for the dataset and yaml file is: https://github.com/Genokarma/ImmuneMLTest

I've encountered an issue while running the process on two different systems. On my MacOS system with 16 CPUs and 16GB RAM, the process gets stuck at "parsing the specification." On the Linux system with 48GB RAM, it encounters an issue with encoding (stuck at encoding 1).

I have attempted to troubleshoot the problem on both systems without success. Could you please attempt to execute the process or provide any suggestions to address this issue? Your assistance in resolving this issue is invaluable.

LonnekeScheffer commented 7 months ago

Dear GenoKarma,

Thanks for sharing the test dataset and YAML. I'm currently very busy (in preparation of my PhD defence), and I will have more time available in the last week of January. In the meantime, it would be helpful to try to run the test and Quickstart examples as mentioned in my previous comments. These examples are small and known to take only a short time to run, and can help us find an indication of whether immuneML is actually getting "stuck" on your system, or simply takes a long time to run.

Genokarma commented 7 months ago

Thank you for your prompt response and for sharing the information. I completely understand that you're currently occupied with your PhD defense preparations. Wishing you the best of luck with your PhD defense.

I have used demo data during installation process. In the meantime, I took your advice and ran the Quickstart example again as per your previous suggestions. I'm pleased to inform you that the quickstart/demo went smoothly, and the process was completed successfully. I have attached the screenshots for your reference.

I look forward to connecting with you again in the last week of January.

LonnekeScheffer commented 7 months ago

Dear GenoKarma,

My apologies for the delay, it was a busy period. But I have good news; I finally managed to take a deeper look into this issue, and implement a solution. I cloned your github repository and tried to reproduce your immuneML run. I indeed discovered two issues, one bug and one performance issue, which both were introduced during out recent large refactoring for the alpha version of immuneML 3.

Firstly, there was a bug in KmerFequencyEncoder due to some changed variable names. If you encountered this bug, you would run into the following error message:

--- Exception in _encode_examples : 'SequenceMetadata' object has no attribute 'count'

This bug was solved in the latest version of immuneML. I originally thought that this bug may have been the culprit for your analysis. But when I tried to run immuneML on your entire dataset, I indeed found that I did not even encounter the error above, because immuneML was taking a long time at some step earlier in the encoding process. I was able to locate and fix the issue, the KmerFrequencyEncoder should be a lot faster. With the updated code, encoding your dataset with 4 parallel processes took 8 minutes on my computer.

So in conclusion, if you reinstall the latest version of immuneML (version 3.0.0a3), encoding will be a lot faster. Since immuneML 3 is still in its 'alpha' version, there have been major refactorings and ongoing developments which have not yet been thoroughly tested. We therefore highly appreciate the user feedback, and I will try my best to resolve issues as soon as I can. However, if some issue is halting your work, it is always possible to downgrade to the latest stable immuneML release (v2.2.5).

All the best, Lonneke

Genokarma commented 6 months ago

Hi Lonneke,

Hope your viva went well! Thank you for reaching out.

Yes I have tried your suggestions with newer as well as stable version(s). However, I am not able to run as newer version provide some other error. I have attached log.txt for your reference. Have you prepare docker image for the newer version of ImmuneMl, if yes please share with me.

log.txt Once again thank you. with regards.

uio-bmi / immuneML

ImmuneML: parsing the specification... #170