noncodo / BigRedButton

Pipeline for the unsupervised clustering of homologous RNA structure motifs from sequencing data.
GNU General Public License v2.0
4 stars 1 forks source link

no-SGE-dependent pipeline please #1

Open brobr opened 5 years ago

brobr commented 5 years ago

Hi, what do I need to do to adapt this pipeline to run on a linux box bypassing SGE (I have no access to a server running such a facility); would that lead to restriction for the number of RNA sequences to check? Apart from the SGE problem; some errors appeared on std.out: 1) error at the beginning: /launcher.sh: line 8: module: command not found 2) installation of required R-packages The list of required libraries inside the script mentions 'RcolorBrewer', which should be 'RColorBrewer'

noncodo commented 5 years ago

It should be fairly straightforward to run without SGE, although it may take quite some time to complete. I must admit this was on the to do list, but I never got around to implementing it.

Basically, something like this

CMD="qsub -V -terse -sync y -cwd -P ${GRID_ACCOUNT} -N fa2pp -o $BASEDIR/logs/foldit."$(date +"%d%m%y-%H%M")".out -e $BASEDIR/logs/foldit."$(date +"%d%m%y-%H%M")".err -pe smp 1 -t 1:${MAXGRIDCPU} -b y -S /bin/bash ../fa2pp.sge"

Should be changed to execute the content of ../fa2pp.sge without SGE-specific formatting. In theory, the code would become much simpler, but there is no support for concurrent multithreading. @goranivanisevic may be willing to convert the code into a non-HPC suitable pipeline?

The 'module' command can be removed as long as the appropriate binaries are in your $PATH.

brobr commented 5 years ago

Thanks, I will try...

brobr commented 5 years ago

Ok, I have been struggling along. 1) The java programs needed compiling (not mentioned in any install instructions) and I parked them in /opt/java/dotaligner/java. Thus I had to change ${DOTALNBIN%/*} ps2pp in the java command to ${DOTALNBIN%/*/*} ps2pp

Then java, when trying to run ps2pp, threw this error:

Exception in thread "main" java.io.FileNotFoundException: /home/working/DATA_ANALYSIS/LIZ-sRNAseq/E181931_RVN_sRNA/CLUSTERanalysis/DOTALIGNER_bigredbutton/sRNAsegments-E181931/temp/task-150219-1702 (Is a directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at java.io.FileInputStream.<init>(FileInputStream.java:93) at java.io.FileReader.<init>(FileReader.java:58) at ps2pp.main(ps2pp.java:6)

Turning the command to CMD="java -Xmx4000m -Xms1000m -classpath ${DOTALNBIN%/*/*}/java ps2pp ${BASEDIR}/temp/${SGE_TASK_ID}/*.ps"

Gave me lines as these in the fa2pp.log:

java -Xmx4000m -Xms1000m -classpath /opt/dotaligner/java ps2pp /home/working/DATA_ANALYSIS/LIZ-sRNAseq/E181931_RVN_sRNA/CLUSTERanalysis/DOTALIGNER_bigredbutton/sRNAsegments-E181931-2/temp/task-150219-1844/10_1081020-1083934_2914_sRNA_dp.ps /home/working/DATA_ANALYSIS/LIZ-sRNAseq/E181931_RVN_sRNA/CLUSTERanalysis/DOTALIGNER_bigredbutton/sRNAsegments-E181931-2/temp/task-150219-1844/10_904-2881_1977_target_dp.ps /home/working/DATA_ANALYSIS/LIZ-sRNAseq/E181931_RVN_sRNA/CLUSTERanalysis/DOTALIGNER_bigredbutton/sRNAsegments-E181931-2/temp/task-150219-1844/12_786499-788075_1576_target_dp.ps /home/working/DATA_ANALYSIS/LIZ-sRNAseq/E181931_RVN_sRNA/CLUSTERanalysis/DOTALIGNER_bigredbutton/sRNAsegments-E181931-2/temp/task-150219-1844/1_1275619-1277561_1942_sRNA_dp.ps

followed by a lot of 0s and other stuff; the log file grew to unreadable proportions..... is that necessary? Are those 0s part of the pp files? I did not find any of these files in the temp/task-folder or in the ../pp. The ps input is generated as expected ` 150219-1702 and 150219-1844 are the SGE_IDs of different runs

noncodo commented 5 years ago

There is a precompiled ps2pp.jar in the java/ folder. I realise the following command hadn't been updated, so try changing from

CMD="java -Xmx256m -Xms128m -classpath ${DOTALNBIN%/*} ps2pp ${BASEDIR}/temp/${JOBID}${SGE_TASK_ID}"

to

CMD="java -jar /relative/path/to/ps2pp.jar ./files.ps"

Or something similar. This will output a half-matrix of pairing probabilities, with a final line of unpaired probabilities, c.f.: https://github.com/noncodo/BigRedButton/blob/master/dotaligner/example/snornal140_0001.pp

I might have some spare time in the next fews days to create a 'PC' branch, without all the SGE bling.

May I also ask how many sequences you are querying? And what kind of hardware specs are you planning to run this on?

brobr commented 5 years ago

Hi, thanks for the quick feedback. Yes I figured the java -jar bit out and recompiled dotaligner so that the java stuff ends up together with DotAligner in /opt/dotaligner/bin to keep as close as possible to the original scripts but somehow no pp files are generated. And only ps files are made for the first of the splitted file-blocks (Is that because in a SGE context each of these blocks, together with the command file, gets outsourced to another computer??).

As mentioned above I can get ps output and .pp output written into the logfile (when keeping >> ${BASEDIR}/logs/fa2pp.${SGE_TASK_ID} at the end of the echo $CMD-lines). But I cannot see the use of that (the logs become massive very quickly).

I query ~250 sequences found as targets for the RNAi machinery in our organism, which are collected in the input fasta file. Basically to see whether some secondary structures could be recognition factors. I understood from the paper that your program could address this. But the absence of a description of all the steps in the pipeline makes it quite hard to follow what is supposed to be happening for a wet-scientist with a little bit of programming knowledge..

I am doing this on a 64-bit computer with a i7-7700HQ CPU @ 2.80GHz with 32Gb RAM and ~250 Gb harddisk space available

brobr commented 5 years ago

Hi, I managed to get the pp files created (this needed a 'for'- loop).

For simpler stop-starting points, the attached script tests for the presence of made files so that it skips completed stages and can directly start with the next section that is still incomplete.

The script does not use separate command-files and all the progress stuff has been removed; it made it too complicated and did not help preventing starting over from the beginning when some stages (say the RNAfold stage) had completed OK:

launcher-2.txt

Please rename attached txt to sh and chmod a+x. There is an exit 1 command so that the script runs until the dotaligner section; it would be a great help if that gets 'de-SGE-ed'.

At the moment only one of the 8 available CPUs gets used. Would there be a way to get more CPUs to work on the job ? Could this be a kind of alternative for the SGE approach?

I know that in my files there will be sequences that are fairly similar. Would it help if these gets clustered beforehand (but how to do this and how would that fit into the current pipeline)?

noncodo commented 5 years ago

Yes, I was thinking of just splitting the input into N files, and running N jobs in parallel via quick and dirty bash. But for less than 250 sequences, 1 CPU should be reasonable. Again, my time commitments are pretty invested, so I hope you're not in a hurry. Happy to help debug in the interim.

brobr commented 5 years ago

Hi, as said, could you have a look at the dotaligner section that uses the second external script 'worker.sge'; how that works together with the main script I do not understand and I find it difficult to untangle (i.e. bring that section over into the launcher.sh and still let it do what it is supposed to do). It would be a great help if that parts gets 'de-SGE-ed'. I was hoping to use the findings in a presentation in a couple of weeks.

brobr commented 5 years ago

Well, I managed to get through to the R stage (see the produced script) and then it went pear-shaped again:

.. [] Normalising scores and converting to distance matrix...[ DONE ] [] Preparing cluster analysis script...[ DONE ] [*] Launching cluster analysis... [R] loading required packages... [ DONE ] [R] importing scores... [ DONE ] [R] generating distance matrix... [R] convert to dissimilarity matrix... Error in FUN(X[[i]], ...) : only defined on a data frame with all numeric variables Calls: matrix -> Summary.data.frame -> lapply -> FUN Execution halted

I added some more feedback in the R-script, and it is this section where it stops (~line 30 in the R script):

write("[R] generating distance matrix...", stderr()) write("[R] convert to dissimilarity matrix ...", stderr()) md <- matrix( nrow = max( d ), ncol=max( d ) ) md[ cbind( d , d ) ] <- as.numeric( d ) md[ cbind( d , d ) ] <- as.numeric( d )

I noticed that the file "ids.txt" in the section after that has not been created during all preceding steps. I also cannot find in your code where that could have happened (the first 'ids.txt' is found in the R-script). This absent file is required by the next R-function; around line 36:

write("[R] get unique names ...", stderr()) .. ids <- read.table("ids.txt", header=F)

ids <- sapply(strsplit(as.character(ids),"/"), "[", 2)

ids <- sapply(strsplit(as.character(ids),"\."), "[", 1)

Where in your scripts should this file have been generated...??? How is it supposed to look like?

What now?

PS below some bits of the stuff that have been generated before the pipeline crashed

scores_normalized.tsv:

1 1 pp/10_1081020-1083934_2914_sRNA_dp.pp pp/10_1081020-1083934_2914_sRNA_dp.pp 0.962448 1 2 pp/10_1081020-1083934_2914_sRNA_dp.pp pp/10_151465-151698_233_sRNA_dp.pp 0.150895 1 3 pp/10_1081020-1083934_2914_sRNA_dp.pp pp/10_39875-40800_925_target_dp.pp 0.308495 1 4 pp/10_1081020-1083934_2914_sRNA_dp.pp pp/10_453343-453710_367_target_dp.pp 0.236949

dist.tsv:

1 1 pp/10_1081020-1083934_2914_sRNA_dp.pp pp/10_1081020-1083934_2914_sRNA_dp.pp 0.037552 1 2 pp/10_1081020-1083934_2914_sRNA_dp.pp pp/10_151465-151698_233_sRNA_dp.pp 0.849105 1 3 pp/10_1081020-1083934_2914_sRNA_dp.pp pp/10_39875-40800_925_target_dp.pp 0.691505 1 4 pp/10_1081020-1083934_2914_sRNA_dp.pp pp/10_453343-453710_367_target_dp.pp 0.763051

clustering_R.txt

brobr commented 5 years ago

The same error as above happens when trying the "RFAM_testing_sample_plusShuffled.fasta.gz" as input which generated the following dist.tsv and scores_normalized.tsv:

bash-5.0$ head dist.tsv 1 1 /home/working/DATA_ANALYSIS/LIZ-sRNAseq/E181931_RVN_sRNA/CLUSTERanalysis/DOTALIGNER_bigredbutton/RFAM_testing_sample_plusShuffled/pp/RNA_RF00001_AF237329_4_44_dp.pp /home/working/DATA_ANALYSIS/LIZ-sRNAseq/E181931_RVN_sRNA/CLUSTERanalysis/DOTALIGNER_bigredbutton/RFAM_testing_sample_plusShuffled/pp/RNA_RF00001_AF237329_4_44_dp.pp 0.098866 1 2 /home/working/DATA_ANALYSIS/LIZ-sRNAseq/E181931_RVN_sRNA/CLUSTERanalysis/DOTALIGNER_bigredbutton/RFAM_testing_sample_plusShuffled/pp/RNA_RF00001_AF237329_4_44_dp.pp /home/working/DATA_ANALYSIS/LIZ-sRNAseq/E181931_RVN_sRNA/CLUSTERanalysis/DOTALIGNER_bigredbutton/RFAM_testing_sample_plusShuffled/pp/RNA_RF00001_DQ776904_3_37_dp.pp 0.55734

bash-5.0$ head scores_normalized.tsv 1 1 /home/working/DATA_ANALYSIS/LIZ-sRNAseq/E181931_RVN_sRNA/CLUSTERanalysis/DOTALIGNER_bigredbutton/RFAM_testing_sample_plusShuffled/pp/RNA_RF00001_AF237329_4_44_dp.pp /home/working/DATA_ANALYSIS/LIZ-sRNAseq/E181931_RVN_sRNA/CLUSTERanalysis/DOTALIGNER_bigredbutton/RFAM_testing_sample_plusShuffled/pp/RNA_RF00001_AF237329_4_44_dp.pp 0.901134 1 2 /home/working/DATA_ANALYSIS/LIZ-sRNAseq/E181931_RVN_sRNA/CLUSTERanalysis/DOTALIGNER_bigredbutton/RFAM_testing_sample_plusShuffled/pp/RNA_RF00001_AF237329_4_44_dp.pp /home/working/DATA_ANALYSIS/LIZ-sRNAseq/E181931_RVN_sRNA/CLUSTERanalysis/DOTALIGNER_bigredbutton/RFAM_testing_sample_plusShuffled/pp/RNA_RF00001_DQ776904_3_37_dp.pp 0.44266

What could makes it stall on R-3.5.3? the /'s in the tsv files? clustering-RFAM_testing_sample-R.txt

brobr commented 5 years ago

After editing 'dist.tsv' to look like:

1 1 pp/RNA_RF00001_AF237329_4_44_dp.pp pp/RNA_RF00001_AF237329_4_44_dp.pp 0.098866 1 2 pp/RNA_RF00001_AF237329_4_44_dp.pp pp/RNA_RF00001_DQ776904_3_37_dp.pp 0.55734 1 3 pp/RNA_RF00001_AF237329_4_44_dp.pp pp/RNA_RF00001_X02629_2_45_dp.pp 0.670251 1 4 pp/RNA_RF00001_AF237329_4_44_dp.pp pp/RNA_RF00001_X16851_1_dp.pp 0.580901 1 5 pp/RNA_RF00001_AF237329_4_44_dp.pp pp/RNA_RF00001_Z50075_5_53_dp.pp 0.599701 1 6 pp/RNA_RF00001_AF237329_4_44_dp.pp pp/RNA_RF00002_AF158725_5_74_dp.pp 0.750852

The error is still thrown; This because, as I just saw now, the $ is not exported to the R script (see attached)......it needs escaping in your launcher script: \$

clustering_R.txt

I am getting curious now; the silence, the lack of any more feedback......, Have you ever tested the published pipeline as such????

brobr commented 5 years ago

Well, it now stops after

[R] generating distance matrix... [R] convert to dissimilarity matrix ... [ DONE ]

on

[R] get unique names ... Error in file(file, "rt") : cannot open the connection Calls: read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file 'ids.txt': No such file or directory Execution halted

As mentioned above.... where is this ids.txt ??

noncodo commented 5 years ago

Hi Brobr. Sorry for not being actively supporting this, very busy managing several other projects. As mentioned above, I'll get around to this soon enough. You can also get in touch with @goranivanisevic who also developed this project.

noncodo commented 5 years ago

The R code I pushed to github is quite sloppy; we were in a hurry to submit this manuscript before the special issue deadline and, unfortunately, code review took the brunt of it. Luckily there are dedicate souls like yourself to whip it back into shape :)

The "ids.txt" file should be created before launching the R script, it is clearly missing in this version. It should contain all unique sequence names, i.e. "RNA_RF00002_AF158725_5_74". running something like: cut -f 3 dist.tsv | uniq > ids.txt should do it. Stripping the '/'s and '.'s would be nice too, but I'll leave it at that for now.

noncodo commented 5 years ago

In addition to the '$' issue, there may be a type here:

printClust <- function( O ) { for (cl in 1:NumClust) { print(paste("Cluster ======================== > ",cl)) ; print(labels(D)[ O$cluster == cl ]) ; } }

I think O$cluster == cl needs to be Oc$cluster == cl

brobr commented 5 years ago

Ok, thanks for the heads up. I'll try this today.

brobr commented 5 years ago

Ok, I got my (10) clusters (and 47 for the RFAM set; is that correct?)! And I learned a bit of R along the way :-) 1) the opticsXi( optics( D,... line threw an error along the lines of opticsXi not found; After finding via DuckDuckGo that optics was part of dbscan and checking the dbscan module, made me try extractXi( optics( D,... which passed...

2) the below "write" calls in the original launcher script generate "Error in ncolumns - 1 : non-numeric argument to binary operator", possibly because of mixing a number (clusters) with text. The following lines, write("Processed ", clusters, " clusters\n") and write("[R] Processed **",clusters,"** clusters\n", stderr()) have been changed to: message <- c("[R] Processed **",clusters,"** clusters\n") writeLines(message, con=stderr(),"",) which came out ok on a bash terminal

See attached the resulting non-sge-launcher.sh as tested up to the exit line. As mentioned above, the progress counting code has been removed and instead one can step through the various sections and do the pipeline in stages without needing to start over the already completed steps (very handy when the computer suddenly crashes midway aligning stuff; restarting is less of a pain that way)

The section "Process clusters from output" still needs correction/testing; as it is, it won't do much; first step has to take into account ( cd to?) the clusters directory where the clusters as txt files have been written to in the R-stage.

non-sge-launcher_sh.txt

brobr commented 5 years ago

Attached script runs to the end and creates output that looks like that what is being aimed for; up for the maintainers to check this. Changes made to make this happen: 1) correcting 2nd part of the sed call 2) correcting command RNAfold to RNAalifold

Hope this helps anyone trying this pipeline on a powerful computer; the longer the sequences you let the pipeline run on the longer it takes to complete (if at all; locarna gave up on my test set)

full-non-sge-launcher_sh.txt

brobr commented 5 years ago

I asked Sebastian Will for some advice on how to use LocARNA with larger fragments, which resulted in this possible adaption to the script when one installs the latest version of LocARNA (at the moment of writing 2.0.0RC7:)

$MLOCARNABIN --plfold-span 120 --threads=$THRS --probabilistic \ --consistency-transformation --iterations=2 ${file%*.txt}.fasta

If there still is a memory problem he suggests to drop the --iterations=2 option completely

BTW he says:

PS: there is a typo 'mLorcaRNA' in the README https://github.com/noncodo/BigRedButton/blob/master/bigredbutton/README.md Maybe you want to tell them.

PS I won't close the issue as this whole page suddenly was no longer publicly visible. Which would make any sharing efforts futile.