Open rioualen opened 2 years ago
Hi @rioualen
Thanks for reporting the bug.
Here is the command recast to work with the example file.
## Create debug dir
mkdir -p ~/rsat_debug/matrix-from-patterns_bug_2021-10-13
cd ~/rsat_debug/matrix-from-patterns_bug_2021-10-13
## Download the test file
wget --no-clobber https://github.com/rsa-tools/rsat-code/files/7339042/LexA_sites.fasta.txt
####
## Run the commands to reproduce the bug
## Purge the sequence
purge-sequence -i LexA_sites.fasta.txt -format fasta \
-o LexA_sites.fasta_purged.fasta
## Detect over-represented dyads
dyad-analysis -i LexA_sites.fasta_purged.fasta \
-v 1 -quick -sort -timeout 3600 -type any -1str -noov -lth occ 1 -lth occ_sig 0 \
-uth rank 50 -return occ,proba,rank -l 3 -spacing 0-16 -bg upstream-noorf \
-org Escherichia_coli_GCF_000005845.2_ASM584v2 \
-o dyads.tsv
## Assemble the over-represented dyads
pattern-assembly -v 1 -i dyads.tsv \
-2str -maxfl 1 -subst 0 \
-o dyads_asembled.asmb
## Generate a matrix (PSSM) from the over-represented dyads
matrix-from-patterns -v 1 -logo \
-seq LexA_sites.fasta_purged.fasta -format fasta \
-asmb dyads_asembled.asmb -min_weight 5 -flanks 2 -max_asmb_nb 20 -cluster sig -uth Pval 0.00025 -bginput -markov 0 \
-o matrix_from_dyads
@rioualen
On the command line (ran on the Fungi server) I obtain the following result, which looks like a correct matrix.
more matrix_from_dyads_count_matrices.tf
AC cluster_1
XX
ID cluster_1
XX
DE mhwwwACTGkATawwtATmCAGTwwwdk
P0 a c g t
1 14 10 7 9
2 11 11 4 14
3 11 2 4 23
4 21 2 5 12
5 10 3 0 27
6 29 2 9 0
7 0 39 0 1
8 0 0 0 40
9 0 0 40 0
10 2 0 11 27
11 28 2 2 8
12 3 3 1 33
13 25 0 8 7
14 10 2 4 24
15 24 4 2 10
16 7 8 0 25
17 33 1 3 3
18 8 2 2 28
19 27 11 0 2
20 0 40 0 0
21 40 0 0 0
22 1 0 39 0
23 0 9 2 29
24 27 0 3 10
25 12 5 2 21
26 23 4 2 11
27 14 4 11 11
28 9 7 10 14
XX
CC program: feature
CC matrix.nb: 1
CC matrix.nb: 1
CC sites: 40
CC consensus.strict: attatACTGtATatatATaCAGTataat
CC consensus.strict.rc: ATTATACTGTATATATATACAGTATAAT
CC consensus.IUPAC: mhwwwACTGkATawwtATmCAGTwwwdk
CC consensus.IUPAC.rc: MHWWWACTGKATAWWTATMCAGTWWWDK
CC consensus.regexp: [ac][act][at][at][at]ACTG[gt]ATa[at][at]tAT[ac]CAGT[at][at][at][agt][gt]
CC consensus.regexp.rc: [AC][ACT][AT][AT][AT]ACTG[GT]ATA[AT][AT]TAT[AC]CAGT[AT][AT][AT][AGT][GT]
XX
//
I ran the same test on the sinik
server (in Cuernavaca) and I get the same result (one matrix from cluster 1)
However when I analyse the same fasta file with dyad-analysis on the Web server https://bacteria.rsat.eu/ there is a bug:
Command to generate matrices (PSSM): $RSAT/perl-scripts/matrix-from-patterns -v 1 -logo -seq $RSAT/public_html/tmp/apache/2021/10/13/tmp_sequence_2021-10-13.101535_uJAOZZ.fasta -format fasta -asmb $RSAT/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.101535_d7wNZv.asmb -min_weight 5 -flanks 2 -max_asmb_nb 20 -cluster sig -uth Pval 0.00025 -bginput -markov 0 -o $RSAT/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.101535_d7wNZv_pssm /space23/rsat_2021/rsat/perl-scripts/convert-matrix -i /space23/rsat_2021/rsat/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.101535_d7wNZv_pssm_sig_matrices_rescaled.tf -from tf -to tf -o /space23/rsat_2021/rsat/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.101535_d7wNZv_pssm_sig_matrices_nr_data/sig_input_motifs_processed_1.tf Error: OpenInputFile: File /space23/rsat_2021/rsat/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.101535_d7wNZv_pssm_sig_matrices_nr_tables/clusters.tab does not exist. Error occurred on RSAT site: sinik; host server: sinik; admin: ati@ccg.unam.mx
Error: OpenInputFile: File /space23/rsat_2021/rsat/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.101535_d7wNZv_pssm_sig_matrices_nr_cluster_root_motifs.tf does not exist. Error occurred on RSAT site: sinik; host server: sinik; admin: ati@ccg.unam.mx
Error: OpenInputFile: File /space23/rsat_2021/rsat/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.101535_d7wNZv_pssm_sig_sites.ft does not exist. Error occurred on RSAT site: sinik; host server: sinik; admin: ati@ccg.unam.mx
Warning: Matrix file is empty (file size is zero) $RSAT/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.101535_d7wNZv_pssm_count_matrices.tf
Warning: Input file contained not a single matrix
Warning: Matrix file is empty (file size is zero) $RSAT/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.101535_d7wNZv_pssm_count_matrices.tf
Warning: Input file contained not a single matrix
On the Fungi server the program runs fine but there are two matrices in the result
Here are the commands that are running on the Bacteria server.
$RSAT/perl-scripts/convert-seq -i $RSAT/public_html/tmp/apache/2021/10/13/tmp_sequence_2021-10-13.103246_HeagcC -from fasta -to fasta -mask non-dna -o $RSAT/public_html/tmp/apache/2021/10/13/tmp_sequence_2021-10-13.103246_HeagcC.fasta
$RSAT/perl-scripts/purge-sequence -i $RSAT/public_html/tmp/apache/2021/10/13/tmp_sequence_2021-10-13.105611_c2rMWX.fasta -format fasta -o $RSAT/public_html/tmp/apache/2021/10/13/tmp_sequence_2021-10-13.105611_c2rMWX.fasta.purged; $RSAT/perl-scripts/dyad-analysis -i $RSAT/public_html/tmp/apache/2021/10/13/tmp_sequence_2021-10-13.105611_c2rMWX.fasta.purged -v 1 -quick -sort -timeout 3600 -type any -2str -noov -lth occ 1 -lth occ_sig 0 -uth rank 50 -return occ,proba,rank -l 3 -spacing 0-20 -bg upstream-noorf -org Escherichia_coli_GCF_000005845.2_ASM584v2
dyad-analysis -i $RSAT/public_html/tmp/apache/2021/10/13/tmp_sequence_2021-10-13.105611_c2rMWX.fasta.purged -v 1 -quick -sort -timeout 3600 -type any -2str -noov -lth occ 1 -lth occ_sig 0 -uth rank 50 -return occ,proba,rank -l 3 -spacing 0-20 -bg upstream-noorf -org Escherichia_coli_GCF_000005845.2_ASM584v2
pattern-assembly command: $RSAT/perl-scripts/pattern-assembly -v 1 -subst 1 -weight 5 -maxfl 1 -toppat 50 -2str -max_asmb_nb 20 -i $RSAT/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.105611_y5nzUU.tab -o $RSAT/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.105611_y5nzUU.asmb
Command to generate matrices (PSSM):
$RSAT/perl-scripts/matrix-from-patterns -v 1 -logo -seq $RSAT/public_html/tmp/apache/2021/10/13/tmp_sequence_2021-10-13.105611_c2rMWX.fasta -format fasta -asmb $RSAT/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.105611_y5nzUU.asmb -min_weight 5 -flanks 2 -max_asmb_nb 20 -cluster sig -uth Pval 0.00025 -bginput -markov 0 -o $RSAT/public_html/tmp/apache/2021/10/13/dyad-analysis_2021-10-13.105611_y5nzUU_pssm
I guess the bug occurs in the last command. The strange thing is that on the Fungi server this bug does not occur. I will check this soon.
@santanaw checked on a Virtual machine and the bug does not appear.
After @santanaw, this bug seems related to a previous bug in matrix-clustering
, that has been fixed by him and @jaimicore .
The next hypothesis is that the R package needs to be recompiled on the Bacteria server.
I recompiled the R packages and the bug remains. Apparently the problem is elsewhere.
@rioualen , @jvanheld, @amedina-liigh The issue arises as result of using the previously installed R (version 3.6) when using the RSAT Prokaryotes Web server. The appropriate R distribution (version 4.1) is never found. Please note that this is not the case in the command line. The installed packages are aimed for v4.1 and this in turn leads to problems while loading the packages in R 3.6.
I will come back to the sys admins to find a solution.
With @rioualen , we just did a new test with the Fungi server (http://fungi.rsat.eu/).
On the web interface, the motifs are clustered correctly with the matrix-clustering option "sig" but not with the option "counts".
A new small test to debug: E.coli upstream sequences of ompR target genes
Placed on rsatix: /home/rsat/packages/rsat-2021/debug/dyad-assembly-bug_2021-11
dyad-analysis -i ompR_targets.fasta -v 1 -quick -sort -timeout 3600 -type any -2str -noov -lth occ 1 -lth occ_sig 0 -uth rank 50 -return occ,proba,rank -l 3 -spacing 0-20 -o dyads.tsv
pattern-assembly -v 1 -subst 1 -weight 5 -maxfl 1 -toppat 50 -2str -max_asmb_nb 20 -i dyads.tsv -o dyads.asmb
matrix-from-patterns -v 1 -logo -seq ompR_targets.fasta -format fasta \
-asmb dyads.asmb -min_weight 5 -flanks 2 -max_asmb_nb 20 -cluster none \
-uth Pval 0.00025 -bginput -markov 0 \
-o dyads_pssm
convert-matrix -v 1 -i dyads_pssm_count_matrices.txt t -from tab -to tab \
-return counts -return consensus -return logo -logo_format png \
-logo_opt -e -logo_opt -M
-logo_dir logos -o dyads_
Hi,
Last year a clustering option was added to the
dyad-analysis
pipeline, but I couldn't make it run. It seems to fail upon executingmatrix-from-patterns
because of missing files.This is the error I get using the option
-cluster sig
:This is the error I get using the option
-cluster counts
:The option
-cluster both
also fails, and the option-cluster none
works normally.I tried it both on the RSAT prokaryotes webserver and in commandline on rsatix, using the following genome for background:
Escherichia_coli_GCF_000005845.2_ASM584v2
@jvanheld @jaimicore LexA_sites.fasta.txt