hillerlab / TOGA

TOGA (Tool to infer Orthologs from Genome Alignments): implements a novel paradigm to infer orthologous genes. TOGA integrates gene annotation, inferring orthologs and classifying genes as intact or lost.
MIT License
156 stars 23 forks source link

Test-data does not pass the step 2 and environment does not detect some libraries #187

Open MauriAndresMU1313 opened 2 hours ago

MauriAndresMU1313 commented 2 hours ago

I'm trying to finish the test process with the following line:

./toga.py test_input/hg38.mm10.chr11.chain test_input/hg38.genCode27.chr11.bed test_input/hg38.2bit test_input/mm10.2bit --kt --pn test -i supply/hg38.wgEncodeGencodeCompV34.isoforms.txt --nc TOGA/nextflow_config_files --cb 3,5 --cjn 500 --u12 supply/hg38.U12sites.tsv --ms

This is the output:

#### Initiating TOGA class ####
# python interpreter path: /home/mmora30/anaconda3/bin/python3
# python interpreter version: 3.11.7 (main, Dec 15 2023, 18:12:31) [GCC 11.2.0]
Version 1.1.8.dev
Commit: c4b5fafa15a94e00fd17eea621af0686ac23041d
Branch: master

# Python package versions
* twobitreader: unknown version
* networkx: 3.2.1
* pandas: 2.1.2
* numpy: 1.26.1
* xgboost: 2.0.1
! scikit-learn: Not installed - will try to install
* joblib: 1.3.2
* h5py: 3.10.0
Calling cmd:
/localData/workspace_mm/toga/TOGA/./configure.sh

Compiling C code...
Model found
CESAR installation found
Command finished with exit code 0.
Does it work?
Calling cmd:
/localData/workspace_mm/toga/TOGA/./modules/chain_score_filter /localData/workspace_mm/toga/TOGA/test_input/hg38.mm10.chr11.chain 15000 > /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.chain

Command finished with exit code 0.
Writing isoforms data for 3674 transcripts.
Found 455 sequences in /localData/workspace_mm/toga/TOGA/test_input/hg38.2bit
Found 455 sequences in /localData/workspace_mm/toga/TOGA/test_input/hg38.2bit
Found 66 sequences in /localData/workspace_mm/toga/TOGA/test_input/mm10.2bit
Saving output to /localData/workspace_mm/toga/TOGA/test
Arguments stored in /localData/workspace_mm/toga/TOGA/test/project_args.json

#### STEP 0: making chain and bed file indexes

Started chain indexing...
chain_bst_index: indexing 79183 chains
chain_bst_index: Saved chain /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.chain index to /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.bst
Started bed file indexing...
bed_hdf5_index: indexed 3674 transcripts

#### STEP 1: Generate extract chain features jobs

Calling cmd:
/localData/workspace_mm/toga/TOGA/./split_chain_jobs.py /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.chain /localData/workspace_mm/toga/TOGA/test/temp/toga_filt_ref_annot.bed /localData/workspace_mm/toga/TOGA/test/temp/toga_filt_ref_annot.hdf5 --log_file /localData/workspace_mm/toga/TOGA/test/toga_2024_10_22_at_14_13.log --parallel_logs_dir /localData/workspace_mm/toga/TOGA/test/temp_logs --jobs_num 100 --jobs /localData/workspace_mm/toga/TOGA/test/temp/chain_classification_jobs --jobs_file /localData/workspace_mm/toga/TOGA/test/temp/chain_class_jobs_combined --results_dir /localData/workspace_mm/toga/TOGA/test/temp/chain_classification_results --rejected /localData/workspace_mm/toga/TOGA/test/temp/rejected/SPLIT_CHAIN_REJ.txt

split_chain_jobs: Use bed file /localData/workspace_mm/toga/TOGA/test/temp/toga_filt_ref_annot.bed and chain file /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.chain
split_chain jobs: the run data overview is:

* vv: False
* jobs: /localData/workspace_mm/toga/TOGA/test/temp/chain_classification_jobs
* results_dir: /localData/workspace_mm/toga/TOGA/test/temp/chain_classification_results
* errors_dir: None
* chain_file: /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.chain
* bed_file: /localData/workspace_mm/toga/TOGA/test/temp/toga_filt_ref_annot.bed
* index_file: /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.chain_ID_position
* job_size: None
* jobs_num: 100
* bed_index: /localData/workspace_mm/toga/TOGA/test/temp/toga_filt_ref_annot.hdf5
* jobs_file: /localData/workspace_mm/toga/TOGA/test/temp/chain_class_jobs_combined
* ref: hg38
* on_cluster: True
split_chain_jobs: searching for intersections between reference transcripts and chains
split_chain_jobs: chains-to-transcripts dict contains 50186 records
split_chain_jobs: skipped 0 transcripts that do not intersect any chain
split_chain_jobs: preparing 50186 commands
split_chain_jobs: command size of 502 for each cluster job
split_chain_jobs: results in 100 cluster jobs
split_chain_jobs: estimated time: 0:00:00.906445
Command finished with exit code 0.

#### STEP 2: Extract chain features: parallel step

Extracting chain features, project name: chain_feats__test_at_1729606404
Project path: /localData/workspace_mm/toga/TOGA/./nextflow_logs/chain_feats__test_at_1729606404
Selected parallelization strategy: nextflow
Parallel manager: pushing job nextflow /localData/workspace_mm/toga/TOGA/execute_joblist.nf --joblist /localData/workspace_mm/toga/TOGA/test/temp/chain_class_jobs_combined -c /home/mmora30/localData/workspace_mm/toga/TOGA/nextflow_config_files/extract_chain_features_queue.nf

Well, the process is not going beyond the STEP 2. However, the only messages that could indicate something are the following:

# Python package versions
* twobitreader: unknown version
...
! scikit-learn: Not installed - will try to install
...

I'm working with in a conda environment, so all the dependencies were installed and for example, I can check if they are:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
_py-xgboost-mutex         2.0                       cpu_0    conda-forge
bzip2                     1.0.8                h4bc722e_7    conda-forge
c-ares                    1.34.2               heb4867d_0    conda-forge
ca-certificates           2024.9.24            h06a4308_0
cached-property           1.5.2                      py_0
h5py                      3.10.0          nompi_py312h1b477d7_101    conda-forge
hdf5                      1.14.3          nompi_hdf9ad27_105    conda-forge
joblib                    1.4.2              pyhd8ed1ab_0    conda-forge
krb5                      1.21.3               h143b758_0
ld_impl_linux-64          2.43                 h712a8e2_1    conda-forge
libaec                    1.1.3                h59595ed_0    conda-forge
libblas                   3.9.0           24_linux64_openblas    conda-forge
libcblas                  3.9.0           24_linux64_openblas    conda-forge
libcurl                   8.10.1               hbbe4b11_0    conda-forge
libedit                   3.1.20230828         h5eee18b_0
libev                     4.33                 h7f8727e_1
libexpat                  2.6.3                h5888daf_0    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc                    14.2.0               h77fa898_1    conda-forge
libgcc-ng                 14.2.0               h69a702a_1    conda-forge
libgfortran               14.2.0               h69a702a_1    conda-forge
libgfortran-ng            14.2.0               h69a702a_1    conda-forge
libgfortran5              14.2.0               hd5240d6_1    conda-forge
libgomp                   14.2.0               h77fa898_1    conda-forge
liblapack                 3.9.0           24_linux64_openblas    conda-forge
libnghttp2                1.64.0               h161d5f1_0    conda-forge
libnsl                    2.0.1                hd590300_0    conda-forge
libopenblas               0.3.27          pthreads_hac2b453_1    conda-forge
libsqlite                 3.47.0               hadc24fc_0    conda-forge
libssh2                   1.11.0               h0841786_0    conda-forge
libstdcxx                 14.2.0               hc0a3c3a_1    conda-forge
libstdcxx-ng              14.2.0               h4852527_1    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libxcrypt                 4.4.36               hd590300_1    conda-forge
libxgboost                2.1.1            cpu_h3a1dfae_4    conda-forge
libzlib                   1.3.1                hb9d3cd8_2    conda-forge
ncurses                   6.5                  he02047a_1    conda-forge
networkx                  3.2.1              pyhd8ed1ab_0    conda-forge
numpy                     1.26.4          py312heda63a1_0    conda-forge
openssl                   3.3.2                hb9d3cd8_0    conda-forge
pandas                    2.1.2           py312hfb8ada1_0    conda-forge
pip                       24.2               pyh8b19718_1    conda-forge
py-xgboost                2.1.1           cpu_pyh15c3653_4    conda-forge
python                    3.12.7          hc5c86c4_0_cpython    conda-forge
python-dateutil           2.9.0post0      py312h06a4308_2
python-tzdata             2023.3             pyhd3eb1b0_0
python_abi                3.12                    5_cp312    conda-forge
pytz                      2024.1          py312h06a4308_0
readline                  8.2                  h8228510_1    conda-forge
scikit-learn              1.3.2           py312h394d371_2    conda-forge
scipy                     1.14.1          py312h62794b6_1    conda-forge
setuptools                75.1.0             pyhd8ed1ab_0    conda-forge
six                       1.16.0             pyhd3eb1b0_1
threadpoolctl             3.5.0              pyhc1e730c_0    conda-forge
tk                        8.6.13          noxft_h4845f30_101    conda-forge
twobitreader              3.1.7              pyh864c0ab_1    bioconda
tzdata                    2024b                hc8b5060_0    conda-forge
wheel                     0.44.0             pyhd8ed1ab_0    conda-forge
xgboost                   2.1.1           cpu_pyhac85b48_4    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
zstd                      1.5.6                ha6fb4c9_0    conda-forge

So, I do not know why is not detecting those libraries. Maybe here is the source of the issue, any ideas how to proceed with this? I'm working in a VM-ubuntu.

What can I do to solve the issue?

MichaelHiller commented 1 hour ago

Hmm, I can only see that scikit-learn is not properly installed. Could you pls check this?

@kirilenkobm Do you see another problem in the log?

MauriAndresMU1313 commented 1 hour ago

Thank you for your quick response! In my conda environment, when I check for that one it is:

scikit-learn              1.3.2           py312h394d371_2    conda-forge

As well as the twobitreader with an unknown version:

twobitreader              3.1.7              pyh864c0ab_1    bioconda

So, I do not understand. What do you think about do it with pip instead?

Is there another directory where I can find any other clue about what is happening?

MichaelHiller commented 1 hour ago

Those should be the right versions of both packages. Lets see what Bogdan suggests.

kirilenkobm commented 1 hour ago

Hi!

Seems like the env is correct, but I suspect it may be a nextflow issue (if so - I am sorry for misleading logs), could you please try calling

./toga.py test_input/hg38.mm10.chr11.chain test_input/hg38.genCode27.chr11.bed test_input/hg38.2bit test_input/mm10.2bit --kt --pn test -i supply/hg38.wgEncodeGencodeCompV34.isoforms.txt --cb 3,5 --cjn 500 --u12 supply/hg38.U12sites.tsv --ms

which is essetially the same command, but without the '--nc' parameter?

MauriAndresMU1313 commented 1 hour ago

Thank you for your suggestion, unfortunately looks like nothing changed so far:

#### Initiating TOGA class ####
# python interpreter path: /home/mmora30/anaconda3/envs/toga/bin/python3
# python interpreter version: 3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 16:05:46) [GCC 13.3.0]
Version 1.1.8.dev
Commit: c4b5fafa15a94e00fd17eea621af0686ac23041d
Branch: master

# Python package versions
* twobitreader: unknown version
* networkx: 3.2.1
* pandas: 2.1.2
* numpy: 1.26.4
* xgboost: 2.1.1
! scikit-learn: Not installed - will try to install
* joblib: 1.4.2
* h5py: 3.10.0
Calling cmd:
/localData/workspace_mm/toga/TOGA/./configure.sh

Compiling C code...
Model found
CESAR installation found
Command finished with exit code 0.
Does it work?
Calling cmd:
/localData/workspace_mm/toga/TOGA/./modules/chain_score_filter test_input/hg38.mm10.chr11.chain 15000 > /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.chain

Command finished with exit code 0.
Writing isoforms data for 3674 transcripts.
Found 455 sequences in /localData/workspace_mm/toga/TOGA/test_input/hg38.2bit
Found 455 sequences in /localData/workspace_mm/toga/TOGA/test_input/hg38.2bit
Found 66 sequences in /localData/workspace_mm/toga/TOGA/test_input/mm10.2bit
Saving output to /localData/workspace_mm/toga/TOGA/test
Arguments stored in /localData/workspace_mm/toga/TOGA/test/project_args.json

#### STEP 0: making chain and bed file indexes

Started chain indexing...
chain_bst_index: indexing 79183 chains
chain_bst_index: Saved chain /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.chain index to /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.bst
Started bed file indexing...
bed_hdf5_index: indexed 3674 transcripts

#### STEP 1: Generate extract chain features jobs

Calling cmd:
/localData/workspace_mm/toga/TOGA/./split_chain_jobs.py /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.chain /localData/workspace_mm/toga/TOGA/test/temp/toga_filt_ref_annot.bed /localData/workspace_mm/toga/TOGA/test/temp/toga_filt_ref_annot.hdf5 --log_file /localData/workspace_mm/toga/TOGA/test/toga_2024_10_22_at_15_01.log --parallel_logs_dir /localData/workspace_mm/toga/TOGA/test/temp_logs --jobs_num 100 --jobs /localData/workspace_mm/toga/TOGA/test/temp/chain_classification_jobs --jobs_file /localData/workspace_mm/toga/TOGA/test/temp/chain_class_jobs_combined --results_dir /localData/workspace_mm/toga/TOGA/test/temp/chain_classification_results --rejected /localData/workspace_mm/toga/TOGA/test/temp/rejected/SPLIT_CHAIN_REJ.txt

split_chain_jobs: Use bed file /localData/workspace_mm/toga/TOGA/test/temp/toga_filt_ref_annot.bed and chain file /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.chain
split_chain jobs: the run data overview is:

* vv: False
* jobs: /localData/workspace_mm/toga/TOGA/test/temp/chain_classification_jobs
* results_dir: /localData/workspace_mm/toga/TOGA/test/temp/chain_classification_results
* errors_dir: None
* chain_file: /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.chain
* bed_file: /localData/workspace_mm/toga/TOGA/test/temp/toga_filt_ref_annot.bed
* index_file: /localData/workspace_mm/toga/TOGA/test/temp/genome_alignment.chain_ID_position
* job_size: None
* jobs_num: 100
* bed_index: /localData/workspace_mm/toga/TOGA/test/temp/toga_filt_ref_annot.hdf5
* jobs_file: /localData/workspace_mm/toga/TOGA/test/temp/chain_class_jobs_combined
* ref: hg38
* on_cluster: True
split_chain_jobs: searching for intersections between reference transcripts and chains
split_chain_jobs: chains-to-transcripts dict contains 50186 records
split_chain_jobs: skipped 0 transcripts that do not intersect any chain
split_chain_jobs: preparing 50186 commands
split_chain_jobs: command size of 502 for each cluster job
split_chain_jobs: results in 100 cluster jobs
split_chain_jobs: estimated time: 0:00:01.109042
Command finished with exit code 0.

#### STEP 2: Extract chain features: parallel step

Extracting chain features, project name: chain_feats__test_at_1729609301
Project path: /localData/workspace_mm/toga/TOGA/./nextflow_logs/chain_feats__test_at_1729609301
Selected parallelization strategy: nextflow
Parallel manager: pushing job nextflow /localData/workspace_mm/toga/TOGA/execute_joblist.nf --joblist /localData/workspace_mm/toga/TOGA/test/temp/chain_class_jobs_combined

Do you have another suggestion? Maybe some file that could give more ideas of what is going on? Or could be because I'm running in local and not using any tool like slurm? Just an idea