rsa-tools / rsat-code

This repo contains the code required to run a local version of the software suite Regulatory Sequence Analysis Tools (RSAT).
http://rsat.eu
GNU Affero General Public License v3.0
5 stars 6 forks source link

dyad-analysis' clustering options failing #5

Open rioualen opened 2 years ago

rioualen commented 2 years ago

Hi,

Last year a clustering option was added to the dyad-analysis pipeline, but I couldn't make it run. It seems to fail upon executing matrix-from-patterns because of missing files.

This is the error I get using the option -cluster sig:

Server command
$RSAT/perl-scripts/purge-sequence -i $RSAT/public_html/tmp/apache/2021/08/03/tmp_sequence_2021-08-03.163413_EgZE6D.fasta -format fasta -o $RSAT/public_html/tmp/apache/2021/08/03/tmp_sequence_2021-08-03.163413_EgZE6D.fasta.purged; $RSAT/perl-scripts/dyad-analysis -i $RSAT/public_html/tmp/apache/2021/08/03/tmp_sequence_2021-08-03.163413_EgZE6D.fasta.purged -v 1 -quick -sort -timeout 3600  -type any -1str -noov -lth occ 1 -lth occ_sig 0 -uth rank 50 -return occ,proba,rank -l 3 -spacing 0-16  -bg upstream-noorf -org Escherichia_coli_GCF_000005845.2_ASM584v2

Command to generate matrices (PSSM): $RSAT/perl-scripts/matrix-from-patterns -v 1 -logo  -seq $RSAT/public_html/tmp/apache/2021/08/03/tmp_sequence_2021-08-03.163413_EgZE6D.fasta -format fasta -asmb $RSAT/public_html/tmp/apache/2021/08/03/dyad-analysis_2021-08-03.163413_JfEfs9.asmb -min_weight 5 -flanks 2 -max_asmb_nb 20 -cluster sig -uth Pval 0.00025 -bginput -markov 0 -o $RSAT/public_html/tmp/apache/2021/08/03/dyad-analysis_2021-08-03.163413_JfEfs9_pssm
/space23/rsat/perl-scripts/convert-matrix -i /space23/rsat/public_html/tmp/apache/2021/08/03/dyad-analysis_2021-08-03.163413_JfEfs9_pssm_sig_matrices_rescaled.tf -from tf -to tf -o /space23/rsat/public_html/tmp/apache/2021/08/03/dyad-analysis_2021-08-03.163413_JfEfs9_pssm_sig_matrices_nr_data/sig_input_motifs_processed_1.tf
Error: Matrix
sig_m4_assembly_4
contains
38
columns. All matrices should have the same width as the first matrix (42).
Error occurred on RSAT site: sinik; host server: sinik; admin: Jacques.van-Helden@univ-amu.fr

Error: OpenInputFile: File /space23/rsat/public_html/tmp/apache/2021/08/03/dyad-analysis_2021-08-03.163413_JfEfs9_pssm_sig_matrices_nr_clusters_information/cluster_2/merged_consensuses/node_4/cluster_2_node_4_matrices.tf does not exist.
Error occurred on RSAT site: sinik; host server: sinik; admin: Jacques.van-Helden@univ-amu.fr

Warning: Matrix file is empty (file size is zero) $RSAT/public_html/tmp/apache/2021/08/03/dyad-analysis_2021-08-03.163413_JfEfs9_pssm_sig_matrices_nr_cluster_root_motifs.tf

Warning: Input file contained not a single matrix

Error: OpenInputFile: File /space23/rsat/public_html/tmp/apache/2021/08/03/dyad-analysis_2021-08-03.163413_JfEfs9_pssm_sig_sites.ft does not exist.
Error occurred on RSAT site: sinik; host server: sinik; admin: Jacques.van-Helden@univ-amu.fr

Warning: Matrix file is empty (file size is zero) $RSAT/public_html/tmp/apache/2021/08/03/dyad-analysis_2021-08-03.163413_JfEfs9_pssm_count_matrices.tf

Warning: Input file contained not a single matrix

Warning: Matrix file is empty (file size is zero) $RSAT/public_html/tmp/apache/2021/08/03/dyad-analysis_2021-08-03.163413_JfEfs9_pssm_count_matrices.tf

Warning: Input file contained not a single matrix

This is the error I get using the option -cluster counts:

Server command
$RSAT/perl-scripts/purge-sequence -i $RSAT/public_html/tmp/apache/2021/08/03/tmp_sequence_2021-08-03.162804_w1m9tN.fasta -format fasta -o $RSAT/public_html/tmp/apache/2021/08/03/tmp_sequence_2021-08-03.162804_w1m9tN.fasta.purged; $RSAT/perl-scripts/dyad-analysis -i $RSAT/public_html/tmp/apache/2021/08/03/tmp_sequence_2021-08-03.162804_w1m9tN.fasta.purged -v 1 -quick -sort -timeout 3600  -type any -1str -noov -lth occ 1 -lth occ_sig 0 -uth rank 50 -return occ,proba,rank -l 3 -spacing 0-16  -bg upstream-noorf -org Escherichia_coli_GCF_000005845.2_ASM584v2

Command to generate matrices (PSSM): $RSAT/perl-scripts/matrix-from-patterns -v 1 -logo  -seq $RSAT/public_html/tmp/apache/2021/08/03/tmp_sequence_2021-08-03.162804_w1m9tN.fasta -format fasta -asmb $RSAT/public_html/tmp/apache/2021/08/03/dyad-analysis_2021-08-03.162804_LiEulL.asmb -min_weight 5 -flanks 2 -max_asmb_nb 20 -cluster counts -uth Pval 0.00025 -bginput -markov 0 -o $RSAT/public_html/tmp/apache/2021/08/03/dyad-analysis_2021-08-03.162804_LiEulL_pssm
Error: OpenInputFile: File /space23/rsat/public_html/tmp/apache/2021/08/03/dyad-analysis_2021-08-03.162804_LiEulL_pssm_count_matrices.tf does not exist.
Error occurred on RSAT site: sinik; host server: sinik; admin: Jacques.van-Helden@univ-amu.fr

The option -cluster both also fails, and the option -cluster none works normally.

I tried it both on the RSAT prokaryotes webserver and in commandline on rsatix, using the following genome for background: Escherichia_coli_GCF_000005845.2_ASM584v2

@jvanheld @jaimicore LexA_sites.fasta.txt

jvanheld commented 2 years ago

Hi @rioualen

Thanks for reporting the bug.

Here is the command recast to work with the example file.

## Create debug dir
mkdir -p ~/rsat_debug/matrix-from-patterns_bug_2021-10-13
cd  ~/rsat_debug/matrix-from-patterns_bug_2021-10-13

## Download the test file
wget --no-clobber https://github.com/rsa-tools/rsat-code/files/7339042/LexA_sites.fasta.txt

####
## Run the commands to reproduce the bug 

## Purge the sequence
purge-sequence -i LexA_sites.fasta.txt -format fasta \
   -o LexA_sites.fasta_purged.fasta

## Detect over-represented dyads
dyad-analysis -i LexA_sites.fasta_purged.fasta \
   -v 1 -quick -sort -timeout 3600  -type any -1str -noov -lth occ 1 -lth occ_sig 0 \
   -uth rank 50 -return occ,proba,rank -l 3 -spacing 0-16  -bg upstream-noorf \
   -org Escherichia_coli_GCF_000005845.2_ASM584v2 \
   -o dyads.tsv

## Assemble the over-represented dyads
pattern-assembly -v 1 -i dyads.tsv \
   -2str -maxfl 1 -subst 0 \
   -o dyads_asembled.asmb

## Generate a matrix (PSSM) from the over-represented dyads
matrix-from-patterns -v 1 -logo  \
   -seq LexA_sites.fasta_purged.fasta -format fasta \
   -asmb dyads_asembled.asmb -min_weight 5 -flanks 2 -max_asmb_nb 20 -cluster sig -uth Pval 0.00025 -bginput -markov 0 \
   -o matrix_from_dyads
jvanheld commented 2 years ago

@rioualen

On the command line (ran on the Fungi server) I obtain the following result, which looks like a correct matrix.

more matrix_from_dyads_count_matrices.tf
AC  cluster_1
XX
ID  cluster_1
XX
DE  mhwwwACTGkATawwtATmCAGTwwwdk
P0           a         c         g         t
1           14        10         7         9
2           11        11         4        14
3           11         2         4        23
4           21         2         5        12
5           10         3         0        27
6           29         2         9         0
7            0        39         0         1
8            0         0         0        40
9            0         0        40         0
10           2         0        11        27
11          28         2         2         8
12           3         3         1        33
13          25         0         8         7
14          10         2         4        24
15          24         4         2        10
16           7         8         0        25
17          33         1         3         3
18           8         2         2        28
19          27        11         0         2
20           0        40         0         0
21          40         0         0         0
22           1         0        39         0
23           0         9         2        29
24          27         0         3        10
25          12         5         2        21
26          23         4         2        11
27          14         4        11        11
28           9         7        10        14
XX
CC  program: feature
CC  matrix.nb: 1
CC  matrix.nb: 1
CC  sites: 40
CC  consensus.strict: attatACTGtATatatATaCAGTataat
CC  consensus.strict.rc: ATTATACTGTATATATATACAGTATAAT
CC  consensus.IUPAC: mhwwwACTGkATawwtATmCAGTwwwdk
CC  consensus.IUPAC.rc: MHWWWACTGKATAWWTATMCAGTWWWDK
CC  consensus.regexp: [ac][act][at][at][at]ACTG[gt]ATa[at][at]tAT[ac]CAGT[at][at][at][agt][gt]
CC  consensus.regexp.rc: [AC][ACT][AT][AT][AT]ACTG[GT]ATA[AT][AT]TAT[AC]CAGT[AT][AT][AT][AGT][GT]
XX
//
jvanheld commented 2 years ago

I ran the same test on the sinik server (in Cuernavaca) and I get the same result (one matrix from cluster 1)

jvanheld commented 2 years ago

However when I analyse the same fasta file with dyad-analysis on the Web server https://bacteria.rsat.eu/ there is a bug:

Command to generate matrices (PSSM): $RSAT/perl-scripts/matrix-from-patterns -v 1 -logo -seq $RSAT/public_html/tmp/apache/2021/10/13/tmp_sequence_2021-10-13.101535_uJAOZZ.fasta -format fasta -asmb $RSAT/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.101535_d7wNZv.asmb -min_weight 5 -flanks 2 -max_asmb_nb 20 -cluster sig -uth Pval 0.00025 -bginput -markov 0 -o $RSAT/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.101535_d7wNZv_pssm /space23/rsat_2021/rsat/perl-scripts/convert-matrix -i /space23/rsat_2021/rsat/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.101535_d7wNZv_pssm_sig_matrices_rescaled.tf -from tf -to tf -o /space23/rsat_2021/rsat/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.101535_d7wNZv_pssm_sig_matrices_nr_data/sig_input_motifs_processed_1.tf Error: OpenInputFile: File /space23/rsat_2021/rsat/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.101535_d7wNZv_pssm_sig_matrices_nr_tables/clusters.tab does not exist. Error occurred on RSAT site: sinik; host server: sinik; admin: ati@ccg.unam.mx

Error: OpenInputFile: File /space23/rsat_2021/rsat/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.101535_d7wNZv_pssm_sig_matrices_nr_cluster_root_motifs.tf does not exist. Error occurred on RSAT site: sinik; host server: sinik; admin: ati@ccg.unam.mx

Error: OpenInputFile: File /space23/rsat_2021/rsat/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.101535_d7wNZv_pssm_sig_sites.ft does not exist. Error occurred on RSAT site: sinik; host server: sinik; admin: ati@ccg.unam.mx

Warning: Matrix file is empty (file size is zero) $RSAT/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.101535_d7wNZv_pssm_count_matrices.tf

Warning: Input file contained not a single matrix

Warning: Matrix file is empty (file size is zero) $RSAT/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.101535_d7wNZv_pssm_count_matrices.tf

Warning: Input file contained not a single matrix

jvanheld commented 2 years ago

On the Fungi server the program runs fine but there are two matrices in the result

image image

jvanheld commented 2 years ago

Here are the commands that are running on the Bacteria server.

$RSAT/perl-scripts/convert-seq  -i $RSAT/public_html/tmp/apache/2021/10/13/tmp_sequence_2021-10-13.103246_HeagcC -from  fasta -to fasta  -mask non-dna -o $RSAT/public_html/tmp/apache/2021/10/13/tmp_sequence_2021-10-13.103246_HeagcC.fasta

$RSAT/perl-scripts/purge-sequence -i $RSAT/public_html/tmp/apache/2021/10/13/tmp_sequence_2021-10-13.105611_c2rMWX.fasta -format fasta -o $RSAT/public_html/tmp/apache/2021/10/13/tmp_sequence_2021-10-13.105611_c2rMWX.fasta.purged; $RSAT/perl-scripts/dyad-analysis -i $RSAT/public_html/tmp/apache/2021/10/13/tmp_sequence_2021-10-13.105611_c2rMWX.fasta.purged -v 1 -quick -sort -timeout 3600  -type any -2str -noov -lth occ 1 -lth occ_sig 0 -uth rank 50 -return occ,proba,rank -l 3 -spacing 0-20  -bg upstream-noorf -org Escherichia_coli_GCF_000005845.2_ASM584v2

 dyad-analysis  -i $RSAT/public_html/tmp/apache/2021/10/13/tmp_sequence_2021-10-13.105611_c2rMWX.fasta.purged -v 1 -quick -sort -timeout 3600 -type any -2str -noov -lth occ 1 -lth occ_sig 0 -uth rank 50 -return occ,proba,rank -l 3 -spacing 0-20 -bg upstream-noorf -org Escherichia_coli_GCF_000005845.2_ASM584v2

pattern-assembly command: $RSAT/perl-scripts/pattern-assembly -v 1 -subst 1 -weight 5 -maxfl 1 -toppat 50 -2str -max_asmb_nb 20 -i $RSAT/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.105611_y5nzUU.tab -o $RSAT/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.105611_y5nzUU.asmb

Command to generate matrices (PSSM):

$RSAT/perl-scripts/matrix-from-patterns -v 1 -logo  -seq $RSAT/public_html/tmp/apache/2021/10/13/tmp_sequence_2021-10-13.105611_c2rMWX.fasta -format fasta -asmb $RSAT/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.105611_y5nzUU.asmb -min_weight 5 -flanks 2 -max_asmb_nb 20 -cluster sig -uth Pval 0.00025 -bginput -markov 0 -o $RSAT/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.105611_y5nzUU_pssm

I guess the bug occurs in the last command. The strange thing is that on the Fungi server this bug does not occur. I will check this soon.

jvanheld commented 2 years ago

@santanaw checked on a Virtual machine and the bug does not appear. After @santanaw, this bug seems related to a previous bug in matrix-clustering, that has been fixed by him and @jaimicore . The next hypothesis is that the R package needs to be recompiled on the Bacteria server.

jvanheld commented 2 years ago

I recompiled the R packages and the bug remains. Apparently the problem is elsewhere.

santanaw commented 2 years ago

@rioualen , @jvanheld, @amedina-liigh The issue arises as result of using the previously installed R (version 3.6) when using the RSAT Prokaryotes Web server. The appropriate R distribution (version 4.1) is never found. Please note that this is not the case in the command line. The installed packages are aimed for v4.1 and this in turn leads to problems while loading the packages in R 3.6.

I will come back to the sys admins to find a solution.

jvanheld commented 2 years ago

With @rioualen , we just did a new test with the Fungi server (http://fungi.rsat.eu/).
On the web interface, the motifs are clustered correctly with the matrix-clustering option "sig" but not with the option "counts".

jvanheld commented 2 years ago

A new small test to debug: E.coli upstream sequences of ompR target genes

Placed on rsatix: /home/rsat/packages/rsat-2021/debug/dyad-assembly-bug_2021-11

jvanheld commented 2 years ago

Dyad-analysis command

dyad-analysis  -i ompR_targets.fasta  -v 1 -quick -sort -timeout 3600 -type any -2str -noov -lth occ 1 -lth occ_sig 0 -uth rank 50 -return occ,proba,rank -l 3 -spacing 0-20 -o dyads.tsv
jvanheld commented 2 years ago

Pattern-assembly command

pattern-assembly  -v 1 -subst 1 -weight 5 -maxfl 1 -toppat 50 -2str -max_asmb_nb 20 -i dyads.tsv -o dyads.asmb
jvanheld commented 2 years ago

Matrix-from-patterns

matrix-from-patterns -v 1 -logo  -seq  ompR_targets.fasta -format fasta \
   -asmb dyads.asmb -min_weight 5 -flanks 2 -max_asmb_nb 20 -cluster none \
   -uth Pval 0.00025 -bginput -markov 0 \
   -o dyads_pssm

 convert-matrix  -v 1 -i dyads_pssm_count_matrices.txt t -from tab -to tab \
   -return counts -return consensus -return logo -logo_format png \
   -logo_opt -e -logo_opt -M 
   -logo_dir logos -o dyads_