jeffdaily / parasail

Pairwise Sequence Alignment Library
Other
243 stars 34 forks source link

-t 24 not working threading #96

Open jianshu93 opened 2 years ago

jianshu93 commented 2 years ago

Hello Team,

parasail_aligner -a parasail_sg_qx_striped_32 -q SILVA_138.1_SSURef_tax_silva_prok_nr_sbsample_half.fasta -f Danish_01.fa -d -t 24 -g parasail.csv

for many sequences in query file (-q) and only one in -f file, I noticed that parasail is not parallel at all despite I ask it to us 24 reads.

Any idea why?

Thanks,

Jianshu

jeffdaily commented 2 years ago

I'm assuming you were building from source since the parasail_aligner app isn't shipped as part of the wheel installs.

If you run parasail_aligner -v it will have one of the following messages:

threads: system-specific default, must be >= 1

or

threads: Warning: ignored; OpenMP was not supported by your compiler

Which one do you see?

jeffdaily commented 2 years ago

Also, if openmp was not found during configuration, you should receive a runtime warning if you specified -t but it wasn't supported.

-t number of threads requested, but OpenMP was not found during configuration. Running without threads.

jianshu93 commented 2 years ago

it is linux. I am using cmake and it is the first situation mentioned. I do not have openmp error.

Jianshu

jeffdaily commented 2 years ago

The code is calling omp_set_num_threads and not doing anything else special. I found the following answer in stack overflow that might help you: https://stackoverflow.com/a/11096742.

Try setting the env var OMP_DYNAMIC=0 and see if it helps.

Does your top or htop output verify whether threading is being used?

jianshu93 commented 2 years ago

Hello Jeff,

I was trying but it does not allow me to run in background with the following error:

input file, query file, and stdin detected; max inputs is 2

I was using the slurm script to submit to a supercomputer:

!/bin/bash

SBATCH --partition=ieg_128g,ieg_lm ### Partition (like a queue in PBS)

SBATCH --job-name=parasial_16S ### Job Name

SBATCH -o /condo/ieg/jianshu/log/%x.%j.%N.out ### File in which to store job output

SBATCH -e /condo/ieg/jianshu/log/%x.%j.%N.err ### File in which to store job error

SBATCH --time=48:00:00 ### Wall clock time limit in Days-HH:MM:SS

SBATCH --nodes=1 ### Node count required for the job

SBATCH --ntasks=1 ### Nuber of tasks to be launched per Node

SBATCH --cpus-per-task=24 ### Number of threads per task (OMP threads)

SBATCH --mem=60G ### memory for each job

SBATCH --mail-type=FAIL ### When to send mail

SBATCH --mail-user=jianshuzhao@yahoo.com. ### mail to send

SBATCH --get-user-env ### Import your user environment setup

SBATCH --requeue ### On failure, requeue for another try

SBATCH --verbose

source ~/.bashrc cd /home/jianshu/data which parasail_aligner parasail_aligner -a parasail_sg_qx_striped_32 -q SILVA_138.1_SSURef_tax_silva_prok_nr_new.fasta -f Danish_01.fa -d -t 24 -g parasail.csv

the parasail-aligner is obtained by compiling using cmake (mkdir build; cd build;cmake ..; make -j 12).

nohup & did not work with the same error.

Any idea why?

Thanks,

Jianshu

jeffdaily commented 2 years ago

Is this still related to your original question of openmp not working? Can we open a new issue for the new observation input file, query file, and stdin detected; max inputs is 2? It seems parasail_aligner needs some fixes to how it detects whether there is piped input from stdin.

jianshu93 commented 2 years ago

Yes! I need to run it in background to use htop or top. It is a server so I do not have choices. and the query sequence file is very large,2 million sequences.

Thanks,

Jianshu

jeffdaily commented 2 years ago

Please try pulling and building the following branch. It needs more testing, but I hope it resolves your current issue with stdin. Please let me know if it does resolve your issue so I can create a new release.

hotfix/2.6.1

jianshu93 commented 2 years ago

Hello Jeff,

I still have it, it is really strange (I download the zip of hot fix branch and compile then compile it). I was using:

nohup parasail_aligner -a parasail_sg_qx_striped_32 -q SILVA_138.1_SSURef_tax_silva_prok_nr_new.fasta -f Danish_01.fa -d -t 24 -g parasail.csv &

and error is:

input file, query file, and stdin detected; max inputs is 2 0.00user 0.01system 0:00.02elapsed 55%CPU (0avgtext+0avgdata 5936maxresident)k 0inputs+8outputs (0major+1560minor)pagefaults 0swaps

Thanks,

Jianshu

jianshu93 commented 2 years ago

Do you have the same problem on your side? e.g., running nohup &

Thanks,

Jianshu

jianshu93 commented 2 years ago

I do not have any problems on MacOS after following exactly the same compiling, which is very strange.

Thanks,

Jianshu

jeffdaily commented 2 years ago

I could reproduce with nohup. Please try the following, where you pipe the query file through nohup as stdin. parasail_aligner does accept stdin query files.

nohup parasail_aligner -a parasail_sg_qx_striped_32 -f Danish_01.fa -d -t 24 -g parasail.csv < SILVA_138.1_SSURef_tax_silva_prok_nr_new.fasta &

jianshu93 commented 2 years ago

Just tried, still the same error with < SILVA_138.1_SSURef_tax_silva_prok_nr_new.fasta

jianshu93 commented 2 years ago

Hello Jeff,

I am very confused with the output, of the above command used:

0,0,1490,1541,1059,1489,1540 1,0,1535,1541,1097,1534,1540 2,0,1534,1541,1095,1533,1540 3,0,1545,1541,912,1544,1539 4,0,1515,1541,998,1514,1539 5,0,1514,1541,990,1513,1539 6,0,1514,1541,988,1513,1539

which one is alignment score by default?

Thanks,

Jianshu

jianshu93 commented 2 years ago

Hello Jeff,

A quick question: is the score a metric? Especially the triangular rules, for sg mode?

Thanks jianshu

jeffdaily commented 2 years ago

For the default/basic output, it writes one line per alignment performed. The sequences are numbered starting from 0 for the input file and query.

i, j, i_len, j_len, parasail_result_get_score(result), parasail_result_get_end_query(result), parasail_result_get_end_ref(result))

jianshu93 commented 2 years ago

Hello Jeff,

I found that parasail generates very different results compare to other global alignment tools, such as vsearch and edlib, I used this command:

parasail_aligner -a parasail_sg_striped_32 -f Danish_01.fa -d -t 4 -g parasail.csv -q Danish_HQ_MQ_MAG_16S_new.fa

Test.zip

parasail.csv edlib-aligner_Danish_01_new.txt

query_vsearch_new.txt

parasail.csv is the output sorted by score column. edlib-aligner_Danish_01_new.txt is results from edlib

edlib-aligner -m HW -p -l Danish_HQ_MQ_MAG_16S.fa Danish_01.fa > edlib-aligner_Danish_01_alignment.txt

while query_vsearch_new.txt is from vsearch:

vsearch --usearch_global ./Danish_01.fa --db Danish_HQ_MQ_MAG_16S.fa --id 0.1 --strand both --maxaccepts 0 --maxrejects 0 --blast6out query_vsearch_new.txt --threads 4

I attached the 2 fasta files in Test.zip. I double checked that edlib and vsearch has very similar results for top 5 best hits found while parasail is very different. I used semi-global alignment for all tools.

Any idea why?

Thanks,

Jianshu

jianshu93 commented 2 years ago

Note that you may need to use grep to extract fasta ID from edlib and parasail output for query names to compare with vsearch.

Jianshu

jeffdaily commented 2 years ago

The alignment function you selected is semi-global, the "sg" in parasail_sg_striped_32. If you wanted global alignment, that would be "nw" for Needleman-Wunsch.

jianshu93 commented 2 years ago

yes I want semi-global. The other two are all semi-global. Have you benchmarked againt standard dataset?

Thanks

Jianshu

jeffdaily commented 2 years ago

Perhaps a more specific semi-global alignment behavior is what you were looking for? Please see the table of all semi-global options.

https://github.com/jeffdaily/parasail#standard-function-naming-convention

As far as benchmarking against standard datasets, no. The tests I wrote only ensure that the reference (non-vectorized) implementations such as parasail_sw get the same results as all of the vectorized variants. The local alignment implementations were initially based on the SSW library and confirmed to get the same results. Early versions of this software only calculated the alignment score and some alignment statistics; when the traceback feature was added the results were compared against EMBOSS and SSW against a handful of randomly selected sequences that are part of the parasail source tree under the data directory.

Also, this project is mostly in maintenance mode. I do not have the time to benchmark against any datasets.