Test-data does not pass the step 2 and environment does not detect some libraries #187

Open MauriAndresMU1313 opened 2 hours ago

MauriAndresMU1313 commented 2 hours ago

I'm trying to finish the test process with the following line:

./toga.py test_input/hg38.mm10.chr11.chain test_input/hg38.genCode27.chr11.bed test_input/hg38.2bit test_input/mm10.2bit --kt --pn test -i supply/hg38.wgEncodeGencodeCompV34.isoforms.txt --nc TOGA/nextflow_config_files --cb 3,5 --cjn 500 --u12 supply/hg38.U12sites.tsv --ms

This is the output:

#### Initiating TOGA class ####
# python interpreter path: /home/mmora30/anaconda3/bin/python3
# python interpreter version: 3.11.7 (main, Dec 15 2023, 18:12:31) [GCC 11.2.0]
Version 1.1.8.dev
Commit: c4b5fafa15a94e00fd17eea621af0686ac23041d
Branch: master

# Python package versions
* twobitreader: unknown version
* networkx: 3.2.1
* pandas: 2.1.2
* numpy: 1.26.1
* xgboost: 2.0.1
! scikit-learn: Not installed - will try to install
* joblib: 1.3.2
* h5py: 3.10.0
Calling cmd:

Compiling C code...
Model found
CESAR installation found
Command finished with exit code 0.
Does it work?
Calling cmd:
/localData/workspace_mm/toga/TOGA/./modules/chain_score_filter /localData/workspace_mm/toga/TOGA/test_input/hg38.mm10.chr11.chain 15000 > /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.chain

Command finished with exit code 0.
Writing isoforms data for 3674 transcripts.
Found 455 sequences in /localData/workspace_mm/toga/TOGA/test_input/hg38.2bit
Found 455 sequences in /localData/workspace_mm/toga/TOGA/test_input/hg38.2bit
Found 66 sequences in /localData/workspace_mm/toga/TOGA/test_input/mm10.2bit
Saving output to /localData/workspace_mm/toga/TOGA/test
Arguments stored in /localData/workspace_mm/toga/TOGA/test/project_args.json

#### STEP 0: making chain and bed file indexes

Started chain indexing...
chain_bst_index: indexing 79183 chains
chain_bst_index: Saved chain /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.chain index to /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.bst
Started bed file indexing...
bed_hdf5_index: indexed 3674 transcripts

#### STEP 1: Generate extract chain features jobs

Calling cmd:
/localData/workspace_mm/toga/TOGA/./split_chain_jobs.py /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.chain /localData/workspace_mm/toga/TOGA/test/temp/toga_filt_ref_annot.bed /localData/workspace_mm/toga/TOGA/test/temp/toga_filt_ref_annot.hdf5 --log_file /localData/workspace_mm/toga/TOGA/test/toga_2024_10_22_at_14_13.log --parallel_logs_dir /localData/workspace_mm/toga/TOGA/test/temp_logs --jobs_num 100 --jobs /localData/workspace_mm/toga/TOGA/test/temp/chain_classification_jobs --jobs_file /localData/workspace_mm/toga/TOGA/test/temp/chain_class_jobs_combined --results_dir /localData/workspace_mm/toga/TOGA/test/temp/chain_classification_results --rejected /localData/workspace_mm/toga/TOGA/test/temp/rejected/SPLIT_CHAIN_REJ.txt

split_chain_jobs: Use bed file /localData/workspace_mm/toga/TOGA/test/temp/toga_filt_ref_annot.bed and chain file /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.chain
split_chain jobs: the run data overview is:

* vv: False
* jobs: /localData/workspace_mm/toga/TOGA/test/temp/chain_classification_jobs
* results_dir: /localData/workspace_mm/toga/TOGA/test/temp/chain_classification_results
* errors_dir: None
* chain_file: /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.chain
* bed_file: /localData/workspace_mm/toga/TOGA/test/temp/toga_filt_ref_annot.bed
* index_file: /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.chain_ID_position
* job_size: None
* jobs_num: 100
* bed_index: /localData/workspace_mm/toga/TOGA/test/temp/toga_filt_ref_annot.hdf5
* jobs_file: /localData/workspace_mm/toga/TOGA/test/temp/chain_class_jobs_combined
* ref: hg38
* on_cluster: True
split_chain_jobs: searching for intersections between reference transcripts and chains
split_chain_jobs: chains-to-transcripts dict contains 50186 records
split_chain_jobs: skipped 0 transcripts that do not intersect any chain
split_chain_jobs: preparing 50186 commands
split_chain_jobs: command size of 502 for each cluster job
split_chain_jobs: results in 100 cluster jobs
split_chain_jobs: estimated time: 0:00:00.906445
Command finished with exit code 0.

#### STEP 2: Extract chain features: parallel step

Extracting chain features, project name: chain_feats__test_at_1729606404
Project path: /localData/workspace_mm/toga/TOGA/./nextflow_logs/chain_feats__test_at_1729606404
Selected parallelization strategy: nextflow
Parallel manager: pushing job nextflow /localData/workspace_mm/toga/TOGA/execute_joblist.nf --joblist /localData/workspace_mm/toga/TOGA/test/temp/chain_class_jobs_combined -c /home/mmora30/localData/workspace_mm/toga/TOGA/nextflow_config_files/extract_chain_features_queue.nf

Well, the process is not going beyond the STEP 2. However, the only messages that could indicate something are the following:

# Python package versions
* twobitreader: unknown version
! scikit-learn: Not installed - will try to install

I'm working with in a conda environment, so all the dependencies were installed and for example, I can check if they are:

So, I do not know why is not detecting those libraries. Maybe here is the source of the issue, any ideas how to proceed with this? I'm working in a VM-ubuntu.

What can I do to solve the issue?

MichaelHiller commented 1 hour ago

Hmm, I can only see that scikit-learn is not properly installed. Could you pls check this?

@kirilenkobm Do you see another problem in the log?

MauriAndresMU1313 commented 1 hour ago

Thank you for your quick response! In my conda environment, when I check for that one it is:

scikit-learn              1.3.2           py312h394d371_2    conda-forge

As well as the twobitreader with an unknown version:

twobitreader              3.1.7              pyh864c0ab_1    bioconda

So, I do not understand. What do you think about do it with pip instead?

Is there another directory where I can find any other clue about what is happening?

MichaelHiller commented 1 hour ago

Those should be the right versions of both packages. Lets see what Bogdan suggests.

kirilenkobm commented 1 hour ago


Seems like the env is correct, but I suspect it may be a nextflow issue (if so - I am sorry for misleading logs), could you please try calling

./toga.py test_input/hg38.mm10.chr11.chain test_input/hg38.genCode27.chr11.bed test_input/hg38.2bit test_input/mm10.2bit --kt --pn test -i supply/hg38.wgEncodeGencodeCompV34.isoforms.txt --cb 3,5 --cjn 500 --u12 supply/hg38.U12sites.tsv --ms

which is essetially the same command, but without the '--nc' parameter?

MauriAndresMU1313 commented 1 hour ago

Thank you for your suggestion, unfortunately looks like nothing changed so far:

#### Initiating TOGA class ####
# python interpreter path: /home/mmora30/anaconda3/envs/toga/bin/python3
# python interpreter version: 3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 16:05:46) [GCC 13.3.0]
Version 1.1.8.dev
Commit: c4b5fafa15a94e00fd17eea621af0686ac23041d
Branch: master

# Python package versions
* twobitreader: unknown version
* networkx: 3.2.1
* pandas: 2.1.2
* numpy: 1.26.4
* xgboost: 2.1.1
! scikit-learn: Not installed - will try to install
* joblib: 1.4.2
* h5py: 3.10.0
Calling cmd:

Compiling C code...
Model found
CESAR installation found
Command finished with exit code 0.
Does it work?
Calling cmd:
/localData/workspace_mm/toga/TOGA/./modules/chain_score_filter test_input/hg38.mm10.chr11.chain 15000 > /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.chain

Command finished with exit code 0.
Writing isoforms data for 3674 transcripts.
Found 455 sequences in /localData/workspace_mm/toga/TOGA/test_input/hg38.2bit
Found 455 sequences in /localData/workspace_mm/toga/TOGA/test_input/hg38.2bit
Found 66 sequences in /localData/workspace_mm/toga/TOGA/test_input/mm10.2bit
Saving output to /localData/workspace_mm/toga/TOGA/test
Arguments stored in /localData/workspace_mm/toga/TOGA/test/project_args.json

#### STEP 0: making chain and bed file indexes

Started chain indexing...
chain_bst_index: indexing 79183 chains
chain_bst_index: Saved chain /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.chain index to /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.bst
Started bed file indexing...
bed_hdf5_index: indexed 3674 transcripts

#### STEP 1: Generate extract chain features jobs

Calling cmd:
/localData/workspace_mm/toga/TOGA/./split_chain_jobs.py /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.chain /localData/workspace_mm/toga/TOGA/test/temp/toga_filt_ref_annot.bed /localData/workspace_mm/toga/TOGA/test/temp/toga_filt_ref_annot.hdf5 --log_file /localData/workspace_mm/toga/TOGA/test/toga_2024_10_22_at_15_01.log --parallel_logs_dir /localData/workspace_mm/toga/TOGA/test/temp_logs --jobs_num 100 --jobs /localData/workspace_mm/toga/TOGA/test/temp/chain_classification_jobs --jobs_file /localData/workspace_mm/toga/TOGA/test/temp/chain_class_jobs_combined --results_dir /localData/workspace_mm/toga/TOGA/test/temp/chain_classification_results --rejected /localData/workspace_mm/toga/TOGA/test/temp/rejected/SPLIT_CHAIN_REJ.txt

split_chain_jobs: Use bed file /localData/workspace_mm/toga/TOGA/test/temp/toga_filt_ref_annot.bed and chain file /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.chain
split_chain jobs: the run data overview is:

* vv: False
* jobs: /localData/workspace_mm/toga/TOGA/test/temp/chain_classification_jobs
* results_dir: /localData/workspace_mm/toga/TOGA/test/temp/chain_classification_results
* errors_dir: None
* chain_file: /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.chain
* bed_file: /localData/workspace_mm/toga/TOGA/test/temp/toga_filt_ref_annot.bed
* index_file: /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.chain_ID_position
* job_size: None
* jobs_num: 100
* bed_index: /localData/workspace_mm/toga/TOGA/test/temp/toga_filt_ref_annot.hdf5
* jobs_file: /localData/workspace_mm/toga/TOGA/test/temp/chain_class_jobs_combined
* ref: hg38
* on_cluster: True
split_chain_jobs: searching for intersections between reference transcripts and chains
split_chain_jobs: chains-to-transcripts dict contains 50186 records
split_chain_jobs: skipped 0 transcripts that do not intersect any chain
split_chain_jobs: preparing 50186 commands
split_chain_jobs: command size of 502 for each cluster job
split_chain_jobs: results in 100 cluster jobs
split_chain_jobs: estimated time: 0:00:01.109042
Command finished with exit code 0.

#### STEP 2: Extract chain features: parallel step

Extracting chain features, project name: chain_feats__test_at_1729609301
Project path: /localData/workspace_mm/toga/TOGA/./nextflow_logs/chain_feats__test_at_1729609301
Selected parallelization strategy: nextflow
Parallel manager: pushing job nextflow /localData/workspace_mm/toga/TOGA/execute_joblist.nf --joblist /localData/workspace_mm/toga/TOGA/test/temp/chain_class_jobs_combined

Do you have another suggestion? Maybe some file that could give more ideas of what is going on? Or could be because I'm running in local and not using any tool like slurm? Just an idea