gkudla / hyb

hyb: a bioinformatics pipeline for the analysis of CLASH (crosslinking, ligation and sequencing of hybrids) data

GNU General Public License v3.0

13 stars 7 forks source link

hyb analyse fails to run #8

Open SreeniEadara opened 2 years ago

SreeniEadara commented 2 years ago

Hi,

I'm trying to run hyb on the example data using Mac OSX on a 2018 MacBook Air.

I've installed all dependencies besides flexbar 2.5 using Conda (edit: flexbar 2.5 was installed manually). My list of installed packages is as follows:

blast                     2.6.0               boost1.64_2    bioconda
blat                      35                            1    bioconda
bowtie2                   2.4.5            py39he245752_2    bioconda
bzip2                     1.0.8                h0d85af4_4    conda-forge
ca-certificates           2022.6.15            h033912b_0    conda-forge
certifi                   2022.6.15        py39h6e9494a_0    conda-forge
expat                     2.4.8                h96cf925_0    conda-forge
fastqc                    0.11.9               hdfd78af_1    bioconda
font-ttf-dejavu-sans-mono 2.37                 hab24e00_0    conda-forge
fontconfig                2.14.0               h676cef8_0    conda-forge
freetype                  2.12.1               h3f81eb7_0    conda-forge
libcxx                    14.0.6               hce7ea42_0    conda-forge
libffi                    3.4.2                h0d85af4_5    conda-forge
libpng                    1.6.37               h5481273_4    conda-forge
libsqlite                 3.39.2               h5a3d3bf_1    conda-forge
libzlib                   1.2.12               hfe4f2af_2    conda-forge
lz4-c                     1.9.3                he49afe7_1    conda-forge
ncurses                   6.3                  h96cf925_1    conda-forge
oligoarrayaux             3.8                  h770b8ee_0    bioconda
openjdk                   17.0.3               hfa58983_1    conda-forge
openssl                   3.0.5                hb81d4ab_1    conda-forge
perl                      5.32.1          2_h0d85af4_perl5    conda-forge
pip                       22.2.2             pyhd8ed1ab_0    conda-forge
python                    3.9.13          hf8d34f4_0_cpython    conda-forge
python_abi                3.9                      2_cp39    conda-forge
readline                  8.1.2                h3899abd_0    conda-forge
setuptools                65.0.1           py39h6e9494a_0    conda-forge
sqlite                    3.39.2               hd9f0692_1    conda-forge
tbb                       2020.2               h940c156_4    conda-forge
tk                        8.6.12               h5dbffcc_0    conda-forge
tzdata                    2022c                h191b570_0    conda-forge
viennarna                 2.1.9                         0    bioconda
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h775f41a_0    conda-forge
zlib                      1.2.12               hfe4f2af_2    conda-forge
zstd                      1.5.2                hb844be6_4    conda-forge

I've also configured my Conda environment to set a few useful paths on activation as follows. The paths are all unset prior to deactivation:

export DYLD_LIBRARY_PATH=$CONDA_PREFIX/flexbin/
export HYB_DB=$CONDA_PREFIX/data/db
export HYB_HOME=$CONDA_PREFIX

I've also configured sra-tools, and changed the shebang line on the top of sam2blast to #!/usr/bin/env python3 so it can work on MacOS. All of the contents of hyb's source, including the scripts in bin, the entry in man, data, and lib have been moved to the corresponding folders in the path of the Conda environment so that they can easily be accessed upon activation. I also used make all to make the included hOH7 database.

I am able to run all steps of the pipeline, including preprocess, check, and detect without error. Upon trying to run hyb analyse, however, I am met with the following output and am not sure what is causing this problem:

hyb: Tue Aug 16 15:47:38 EDT 2022
analyse
in=testdata.txt id=testdata format=fastq code= miss=0 qc=flexbar qual=33 link=TGGAATTCTCGGGTGCCAAGGC min=4 len=17 trim=0 filt=0 pc=0 align=bowtie2 db=hOH7 word=11 eval=0.1 ref= anti=0 type=all fold=UNAfold pref=mim hval=0.1 hmax=10 gmax=4

/usr/local/Caskroom/miniconda/base/envs/hyb/bin/hyb2fasta_bits_allRNAs.awk /usr/local/Caskroom/miniconda/base/envs/hyb/data/db/hOH7.tab testdata_comp_hOH7_hybrids_ua.hyb
/usr/local/Caskroom/miniconda/base/envs/hyb/bin/hybrid-min testdata_comp_hOH7_hybrids_ua.bit_1.fasta testdata_comp_hOH7_hybrids_ua.bit_2.fasta 2>&1 > /dev/null
testdata_comp_hOH7_hybrids_ua.bit_1.fasta: No such file or directory
make: *** [testdata_comp_hOH7_hybrids_ua.bit_1.fasta-comp_hOH7_hybrids_ua.bit_2.fasta.ct] Error 1

Could you please help me understand what is causing this problem?

Thanks!

Sincerely, Sreenivas

tony-travis commented 2 years ago

On 16/08/2022 21:10, SreeniEadara wrote:

[...] I am able to run all steps of the pipeline, including preprocess, check, and detect without error. Upon trying to run hyb analyse, however, I am met with the following output and am not sure what is causing this problem:

|hyb: Tue Aug 16 15:47:38 EDT 2022 analyse in=testdata.txt id=testdata format=fastq code= miss=0 qc=flexbar qual=33 link=TGGAATTCTCGGGTGCCAAGGC min=4 len=17 trim=0 filt=0 pc=0 align=bowtie2 db=hOH7 word=11 eval=0.1 ref= anti=0 type=all fold=UNAfold pref=mim hval=0.1 hmax=10 gmax=4 /usr/local/Caskroom/miniconda/base/envs/hyb/bin/hyb2fasta_bits_allRNAs.awk /usr/local/Caskroom/miniconda/base/envs/hyb/data/db/hOH7.tab testdata_comp_hOH7_hybrids_ua.hyb /usr/local/Caskroom/miniconda/base/envs/hyb/bin/hybrid-min testdata_comp_hOH7_hybrids_ua.bit_1.fasta testdata_comp_hOH7_hybrids_ua.bit_2.fasta 2>&1 > /dev/null testdata_comp_hOH7_hybrids_ua.bit_1.fasta: No such file or directory make: *** [testdata_comp_hOH7_hybrids_ua.bit_1.fasta-comp_hOH7_hybrids_ua.bit_2.fasta.ct] Error 1 |

Could you please help me understand what is causing this problem?

Hi, Sreenivas.

Great job creating a "hyb" env in Bioconda/Anaconda! It would be nice to add that to our GitHub repo when you've got it tested and working.

The missing file requires "flexbar" to run, but there is a bug in "hyb" caused by a change in the "flexbar -f" parameter, which now means produce fasta output: It previously meant specify the quality format e.g. "-f sanger", but it now means output FASTA.

I'll fix this on GitHub along with your fixes for "python", which is also a problem on Ubuntu 20.04 LTS because "python" is deprecated.

I've attached a patch for "hyb" that I'm now testing...

HTH,

Tony.

-- Minke Informatics Limited, Registered in Scotland - Company No. SC419028 Registered Office: 3 Donview, Bridge of Alford, AB33 8QJ, Scotland (UK) tel. +44(0)19755 63548 http://minke-informatics.co.uk mob. +44(0)7985 078324 @.***

SreeniEadara commented 2 years ago

Hi Tony,

Awesome! Happy to hear from you. I can definitely open a pull request containing the Conda env setup once it has been validated.

I think attachments from email replies may not make it onto GitHub Issues, would you be able to add it in a development branch on this repository?

Thanks for your help!

Sincerely, Sreenivas

edit: removed my email, don't want it to be found by bots :)

tony-travis commented 2 years ago

On 17/08/2022 17:45, SreeniEadara wrote:

Hi Tony,

Awesome! Happy to hear from you. I can definitely open a pull request containing the Conda env setup once it has been validated.

I think attachments from email replies may not make it onto GitHub Issues, would you be able to email the patch to me at @. @.> or add it in a development branch on this repository?

Hi, Sreenivas.

I'll commit my changes as soon as I've finished testing: I noticed a couple of dependency problems and I changed the way the INSTALL script runs. As you probably know, we developed "hyb" under Bio-Linux 8, but that distro is now obsolete. I'm testing it under Ubuntu 20.04 LTS.

Bye,

Tony.

SreeniEadara commented 2 years ago

Hi Tony,

I was able to run hyb analyse on testdata.txt and didn't encounter any errors! Would you be able to send me the expected output so I can compare it against what I have?

I ended up using WSL to install Ubuntu 20.04 LTS and followed all installation steps - further debugging didn't work on macOS when using the Conda environment. One thing to note is that I had to install manually. I used git to clone the repository, and upon running INSTALL, it found the existing files and cleared them, but subsequently failed to get the files for hyb.

Upon running the following on my data I encountered the following:

sreenieadara@DESKTOP:/mnt/d/hyb/SRR959751$ hyb preprocess qc=flexbar trim=30 len=17 min=4 check detect align=bowtie2 word=11 analyse fold=vienna in=SRR959751.fastq.gz db=hOH7
hyb: Fri Aug 19 18:31:41 PDT 2022
preprocess check detect analyse
in=SRR959751.fastq.gz id=SRR959751 format=fastq code= miss=0 qc=flexbar qual=33 link=TGGAATTCTCGGGTGCCAAGGC min=4 len=17 trim=30 filt=0 pc=0 align=bowtie2 db=hOH7 word=11 eval=0.1 ref= anti=0 type=all fold=vienna pref=mim hval=0.1 hmax=10 gmax=4
gunzip -c SRR959751.fastq.gz > SRR959751.fastq
/usr/bin/flexbar -t SRR959751_clipped_qf -r SRR959751.fastq -q 30 -as TGGAATTCTCGGGTGCCAAGGC -ao 4 -u 3 -m 17 -n 1
flexbar: the given value '30' is not in the list of allowed values [TAIL, WIN, BWA]

Available on github.com/seqan/flexbar

make: *** [/home/sreenieadara/hyb/bin/hyb:1029: SRR959751_clipped_qf.fastq] Error 1

It looks like the -q parameter may not be the correct one to use in this case. I've changed it to -qt within bin/hyb and it is currently running. I will see if this works!

SreeniEadara commented 2 years ago

Hi Tony,

Unfortunately, the analysis is frozen at one step (over 20 hours without a change). Could you please let me know if this is expected or unexpected behavior? I am running the following on a fastq.gz of SRR959751 received via fastq-dump.

This is running in Ubuntu 20.04 LTS.

sreenieadara@DESKTOP:/mnt/d/hyb/SRR959751$ hyb preprocess qc=flexbar trim=30 len=17 min=4 check detect align=bowtie2 word=11 analyse fold=vienna in=SRR959751.fastq.gz db=hOH7
hyb: Fri Aug 19 19:07:52 PDT 2022
preprocess check detect analyse
in=SRR959751.fastq.gz id=SRR959751 format=fastq code= miss=0 qc=flexbar qual=33 link=TGGAATTCTCGGGTGCCAAGGC min=4 len=17 trim=30 filt=0 pc=0 align=bowtie2 db=hOH7 word=11 eval=0.1 ref= anti=0 type=all fold=vienna pref=mim hval=0.1 hmax=10 gmax=4
/usr/bin/flexbar -t SRR959751_clipped_qf -r SRR959751.fastq -qt 30 -as TGGAATTCTCGGGTGCCAAGGC -ao 4 -u 3 -m 17 -n 1
/home/sreenieadara/hyb/bin/solexa2fasta.awk SRR959751_clipped_qf.fastq | /home/sreenieadara/hyb/bin/fasta2tab.awk > SRR959751_clipped_qf.tab
/home/sreenieadara/hyb/bin/make_comp_fasta.pl SRR959751_clipped_qf.tab > SRR959751_comp.fasta
/usr/bin/fastqc -q -k 8 --noextract --contaminants /home/sreenieadara/hyb/data/fastqc/Contaminants SRR959751_clipped_qf.fastq
awk '{if(NR%4==2) print length($1)}' SRR959751_clipped_qf.fastq | /home/sreenieadara/hyb/bin/histogram.pl -n > SRR959751_clipped_qf.hist

Thanks!

Sincerely, Sreenivas

tony-travis commented 2 years ago

On 20/08/2022 23:58, SreeniEadara wrote:

Hi Tony,

Unfortunately, the analysis is frozen at one step (over 20 hours without a change). Could you please let me know if this is expected or unexpected behavior? I am running the following on a fastq.gz of SRR959751 received via fastq-dump.

This is running in Ubuntu 20.04 LTS.

@.***:/mnt/d/hyb/SRR959751$ hyb preprocess qc=flexbar trim=30 len=17 min=4 check detect align=bowtie2 word=11 analyse fold=vienna in=SRR959751.fastq.gz db=hOH7 hyb: Fri Aug 19 19:07:52 PDT 2022 preprocess check detect analyse in=SRR959751.fastq.gz id=SRR959751 format=fastq code= miss=0 qc=flexbar qual=33 link=TGGAATTCTCGGGTGCCAAGGC min=4 len=17 trim=30 filt=0 pc=0 align=bowtie2 db=hOH7 word=11 eval=0.1 ref= anti=0 type=all fold=vienna pref=mim hval=0.1 hmax=10 gmax=4 /usr/bin/flexbar -t SRR959751_clipped_qf -r SRR959751.fastq -qt 30 -as TGGAATTCTCGGGTGCCAAGGC -ao 4 -u 3 -m 17 -n 1 /home/sreenieadara/hyb/bin/solexa2fasta.awk SRR959751_clipped_qf.fastq | /home/sreenieadara/hyb/bin/fasta2tab.awk > SRR959751_clipped_qf.tab /home/sreenieadara/hyb/bin/make_comp_fasta.pl SRR959751_clipped_qf.tab > SRR959751_comp.fasta /usr/bin/fastqc -q -k 8 --noextract --contaminants /home/sreenieadara/hyb/data/fastqc/Contaminants SRR959751_clipped_qf.fastq awk '{if(NR%4==2) print length($1)}' SRR959751_clipped_qf.fastq | /home/sreenieadara/hyb/bin/histogram.pl -n

SRR959751_clipped_qf.hist |

Hi, Sreenivas

I ran it in less than an hour on my laptop "beluga" (Intel core-i5 + 16 GiB RAM + 500GB SSD):

@.***:~/Desktop/hyb$ time hyb preprocess qc=flexbar trim=30 len=17 min=4 check detect align=bowtie2 word=11 analyse fold=vienna in=SRR959751.fastq.gz db=hOH7 |& tee hyb.log hyb: Mon 22 Aug 08:17:58 BST 2022 preprocess check detect analyse in=SRR959751.fastq.gz id=SRR959751 format=fastq code= miss=0 qc=flexbar qual=33 link=TGGAATTCTCGGGTGCCAAGGC min=4 len=17 trim=30 filt=0 pc=0 align=bowtie2 db=hOH7 word=11 eval=0.1 ref= anti=0 type=all fold=vienna pref=mim hval=0.1 hmax=10 gmax=4 gunzip -c SRR959751.fastq.gz > SRR959751.fastq /usr/bin/flexbar -t SRR959751_clipped_qf -r SRR959751.fastq -qt 30 -as TGGAATTCTCGGGTGCCAAGGC -ao 4 -u 3 -m 17 -n 1 /usr/local/hyb/bin/solexa2fasta.awk SRR959751_clipped_qf.fastq | /usr/local/hyb/bin/fasta2tab.awk > SRR959751_clipped_qf.tab /usr/local/hyb/bin/make_comp_fasta.pl SRR959751_clipped_qf.tab > SRR959751_comp.fasta /usr/bin/fastqc -q -k 8 --noextract --contaminants /usr/local/hyb/data/fastqc/Contaminants SRR959751_clipped_qf.fastq awk '{if(NR%4==2) print length($1)}' SRR959751_clipped_qf.fastq | /usr/local/hyb/bin/histogram.pl -n > SRR959751_clipped_qf.hist /usr/local/hyb/bin/fasta2tab.awk SRR959751_comp.fasta | awk '{print (length($2))}' | /usr/local/hyb/bin/histogram.pl -n > SRR959751_comp.hist /usr/bin/bowtie2 -D 20 -R 3 -N 0 -L 16 -k 20 --local -i S,1,0.50 --score-min L,18,0 --ma 1 --np 0 --mp 2,2 --rdg 5,1 --rfg 5,1 -p 1 -x /usr/local/hyb/data/db/hOH7 -f SRR959751_comp.fasta > ./$$.sam 2> SRR959751_comp_hOH7.blast.err; \ sam2blast ./$$.sam > SRR959751_comp_hOH7.blast; \ rm ./$$.sam rm SRR959751_comp_hOH7.blast.err /usr/local/hyb/bin/mtophits_blast SRR959751_comp_hOH7.blast > SRR959751_comp_hOH7_mtophits.blast /usr/local/hyb/bin/create_reference_file.pl SRR959751_comp_hOH7_mtophits.blast > SRR959751_comp_hOH7_mtophits.ref /usr/local/hyb/bin/remove_duplicate_hits_blast.pl SRR959751_comp_hOH7_mtophits.ref SRR959751_comp_hOH7_mtophits.blast > SRR959751_comp_hOH7_singleE.blast /usr/local/hyb/bin/blast_stats SRR959751_comp_hOH7_singleE.blast > SRR959751_comp_hOH7_singleE.blast_stats.txt /usr/local/hyb/bin/get_mtop_hybrids.pl BLAST_THRESHOLD=0.1 MODE=2 MAX_OVERLAP=4 MAX_HITS_PER_SEQUENCE=10 OUTPUT_FORMAT=HYB SRR959751_comp_hOH7.blast > SRR959751_TEMP_FILE1_TXT /usr/local/hyb/bin/getseqs SRR959751_TEMP_FILE1_TXT SRR959751_comp.fasta > SRR959751_comp_hOH7_hybrids.fasta /usr/local/hyb/bin/fasta2tab.awk SRR959751_comp_hOH7_hybrids.fasta > SRR959751_TEMP_FILE1_TAB /usr/local/hyb/bin/txt2hyb.awk SRR959751_TEMP_FILE1_TAB SRR959751_TEMP_FILE1_TXT > SRR959751_comp_hOH7_hybrids.hyb /usr/local/hyb/bin/remove_duplicate_hybrids_hOH5.pl PREFER_MIM=1 SRR959751_comp_hOH7_mtophits.ref SRR959751_comp_hOH7_hybrids.hyb > SRR959751_comp_hOH7_hybrids_ua.hyb /usr/local/hyb/bin/hyb2fasta_bits_allRNAs.awk /usr/local/hyb/data/db/hOH7.tab SRR959751_comp_hOH7_hybrids_ua.hyb paste SRR959751_comp_hOH7_hybrids_ua.bit_1.fasta SRR959751_comp_hOH7_hybrids_ua.bit_2.fasta | awk 'NR%2==1{print $1"-"$2}; NR%2==0{print $1"&"$2}'|sed 's/->/-/g' > SRR959751_comp_hOH7_hybrids_ua.merged /usr/bin/RNAup --interaction_pairwise -o -w 20 < SRR959751_comp_hOH7_hybrids_ua.merged > SRR959751_comp_hOH7_hybrids_ua.rnaup 2> /dev/null /usr/local/hyb/bin/make_vienna SRR959751_comp_hOH7_hybrids_ua.rnaup SRR959751_comp_hOH7_hybrids_ua.merged > SRR959751_comp_hOH7_hybrids_ua.vienna /usr/local/hyb/bin/add_dG_hyb.pl SRR959751_comp_hOH7_hybrids_ua.hyb SRR959751_comp_hOH7_hybrids_ua.vienna >SRR959751_comp_hOH7_hybrids_ua_dg.hyb /usr/local/hyb/bin/combine_hyb_merge TWO_WAY_MERGE=1 PRINT_SEQ_IDS=1 SRR959751_comp_hOH7_hybrids_ua_dg.hyb > SRR959751_comp_hOH7_hybrids_ua_merged.hyb /usr/local/hyb/bin/make_nicer_vienna_hOH5.awk SRR959751_comp_hOH7_hybrids_ua.vienna > SRR959751_comp_hOH7_hybrids_ua.viennad /usr/local/hyb/bin/hybrid_stats SRR959751_comp_hOH7_hybrids_ua_dg.hyb > SRR959751_comp_hOH7_hybrids.hyb_stats.txt rm SRR959751_comp_hOH7_hybrids_ua.rnaup SRR959751_comp_hOH7_hybrids_ua.merged SRR959751_TEMP_FILE1_TXT SRR959751_TEMP_FILE1_TAB

real 56m47.363s user 56m48.892s sys 2m3.663s

There is a bug in "make_vienna" when running it under Python3:

@.***:/home/ajt/src/hyb/bin# git diff make_vienna diff --git a/bin/make_vienna b/bin/make_vienna index 959de60..4710839 100755 --- a/bin/make_vienna +++ b/bin/make_vienna @@ -1,5 +1,5 @@

!/usr/bin/env python3

-#@(#)make_vienna 2022-08-17 last modified by A.J.Travis +#@(#)make_vienna 2022-08-22 last modified by A.J.Travis """ Take fasta file (with '&' separating the sequences) and output from RNAup of the vienna package, and poduce the vienna format expected @@ -53,13 +53,13 @@ def main(rnaup_file, fasta_file): if line.startswith(">"): name, count = line, 0 elif count == 1:

print name.lstrip('>')
print(name.lstrip('>')) seq = seqs[name] seq_split = seq.split('&') len1 = len(seq_split[0]) len2 = len(seq_split[1])
print seq.replace('&','')
print format_brackets(line, len1, len2)
print(seq.replace('&',''))
print(format_brackets(line, len1, len2)) else: pass count += 1

You also need to install the RNA 'Vienna' package:

wget https://www.tbi.univie.ac.at/RNA/download/ubuntu/ubuntu_20_04/viennarna_2.5.1-1_amd64.deb sudo gdebi viennarna_2.5.1-1_amd64.deb

Let me know how you get on?

Tony.

SreeniEadara commented 2 years ago

Hi Tony,

Looks like the bug fix for Vienna worked! I have the vienna package as well as the python3, python, and perl bindings installed. Not sure if those were necessary or not.

Here are the first 10 lines of the result file SRR959751_comp_hOH7_hybrids_ua_dg.hyb:

1215_2879   AAGAGGGACGGCCGGGGGCATTCGTATTGCTCCCTGGTGGTCTAGTGGTTAGGAT -16.60  ENSG000000XXXXX_NR003286-2_RN18S1_rRNA  1   33  919 951 3.4e-08 ENSG_ENST_chr1-trna116-GluCTC_tRNA  31  55  1   25  2e-05   
1577_2209   AAGAGGGACGGCCGGGGGCTATTGCACTTGTCCCGGCCTGT   -17.68  ENSG000000XXXXX_NR003286-2_RN18S1_rRNA  1   19  919 937 0.023   MIMAT0000092_MirBase_miR-92a_microRNA   20  41  1   22  0.0005  
2046_1671   AGAGGGACAAGTGGCGTTCTATTGCACTTGTCCCGGCCTGT   -18.99  ENSG000000XXXXX_NR003286-2_RN18S1_rRNA  1   19  1446    1464    0.023   MIMAT0000092_MirBase_miR-92a_microRNA   20  41  1   22  0.0005  
3050_1082   ACTGCATTATGAGCACTTAAAGTTAAAGTGCTTATAGTGCAGGTAG  -24.37  MIMAT0004493_MirBase_miR-20a*_microRNA  1   22  1   22  0.00066 MIMAT0000075_MirBase_miR-20a_microRNA   24  46  1   23  0.00018 
3068_1076   GGAAGATAACTATACAACCTACTGCCTTCCTGAGGTAGTAGGTTGTGTGGTTTCA -30.53  MIMAT0004482_MirBase_let-7b*_microRNA   10  30  1   21  0.0034  MIMAT0000063_MirBase_let-7b_microRNA    31  52  1   22  0.00094 
3532_922    AAGAGGGACGGCCGGGGGCATTCGTATTGCTCCCTGTGGTCTAGTGGTTAGGATT -9.76   ENSG000000XXXXX_NR003286-2_RN18S1_rRNA  1   33  919 951 3.4e-08 ENSG_ENST_chr1-trna64-GluTTC_tRNA   32  53  1   22  0.00094 
3746_872    GCCCCTGGGCCTATCCTAGAACTTTGGGTTCCGGGGGGAGTATGGTTGC   -17.15  MIMAT0000760_MirBase_miR-331-3p_microRNA    1   21  1   21  0.0027  ENSG000000XXXXX_NR003286-2_RN18S1_rRNA  22  49  1153    1180    3.5e-07 
4016_814    AGAGGGACAAGTGGCGTTTATTGCACTTGTCCCGGCCTGT    -18.99  ENSG000000XXXXX_NR003286-2_RN18S1_rRNA  1   18  1446    1463    0.079   MIMAT0000092_MirBase_miR-92a_microRNA   19  40  1   22  0.00047 
4521_718    CGGAAGATAACTATACAACCTACTGCCTTCCTGAGGTAGTAGGTTGTGTGGTTTC -30.53  MIMAT0004482_MirBase_let-7b*_microRNA   11  31  1   21  0.0034  MIMAT0000063_MirBase_let-7b_microRNA    32  53  1   22  0.00094 
4766_680    TCCCTGAGACCCTAACTTGTGAGTGATGGGGATCGGGGATTGC -19.82  MIMAT0000423_MirBase_miR-125b_microRNA  1   22  1   22  0.00056 ENSG000000XXXXX_NR003286-2_RN18S1_rRNA  23  43  1598    1618    0.002

How does this compare to the result you received?

Also, one additional question - say a miRNA is listed first, and an mRNA is listed second in a single row. Does that mean that the chimera was a miRNA-first chimera, or are they ordered differently (i.e. alphabetical order)?

gkudla commented 2 years ago

Hi Sreenivas,

It seems that your result is very similar to ours (there are small differences which seem related to 3' adapter truncation settings).

If a miRNA is listed first, this indicates a miRNA-first chimera (the coordinates of 1st and 2nd arms in each read are in columns 5-6 and 11-12, respectively).

best wishes, Greg

On Fri, 26 Aug 2022 at 16:13, SreeniEadara @.***> wrote:

Hi Tony,

Looks like the bug fix for Vienna worked! I have the vienna package as well as the python3, python, and perl bindings installed. Not sure if those were necessary or not.

Here are the first 10 lines of the result file SRR959751_comp_hOH7_hybrids_ua_dg.hyb:

1215_2879 AAGAGGGACGGCCGGGGGCATTCGTATTGCTCCCTGGTGGTCTAGTGGTTAGGAT -16.60 ENSG000000XXXXX_NR003286-2_RN18S1_rRNA 1 33 919 951 3.4e-08 ENSG_ENST_chr1-trna116-GluCTC_tRNA 31 55 1 25 2e-05 1577_2209 AAGAGGGACGGCCGGGGGCTATTGCACTTGTCCCGGCCTGT -17.68 ENSG000000XXXXX_NR003286-2_RN18S1_rRNA 1 19 919 937 0.023 MIMAT0000092_MirBase_miR-92a_microRNA 20 41 1 22 0.0005 2046_1671 AGAGGGACAAGTGGCGTTCTATTGCACTTGTCCCGGCCTGT -18.99 ENSG000000XXXXX_NR003286-2_RN18S1_rRNA 1 19 1446 1464 0.023 MIMAT0000092_MirBase_miR-92a_microRNA 20 41 1 22 0.0005 3050_1082 ACTGCATTATGAGCACTTAAAGTTAAAGTGCTTATAGTGCAGGTAG -24.37 MIMAT0004493_MirBase_miR-20a_microRNA 1 22 1 22 0.00066 MIMAT0000075_MirBase_miR-20a_microRNA 24 46 1 23 0.00018 3068_1076 GGAAGATAACTATACAACCTACTGCCTTCCTGAGGTAGTAGGTTGTGTGGTTTCA -30.53 MIMAT0004482_MirBase_let-7b_microRNA 10 30 1 21 0.0034 MIMAT0000063_MirBase_let-7b_microRNA 31 52 1 22 0.00094 3532_922 AAGAGGGACGGCCGGGGGCATTCGTATTGCTCCCTGTGGTCTAGTGGTTAGGATT -9.76 ENSG000000XXXXX_NR003286-2_RN18S1_rRNA 1 33 919 951 3.4e-08 ENSG_ENST_chr1-trna64-GluTTC_tRNA 32 53 1 22 0.00094 3746_872 GCCCCTGGGCCTATCCTAGAACTTTGGGTTCCGGGGGGAGTATGGTTGC -17.15 MIMAT0000760_MirBase_miR-331-3p_microRNA 1 21 1 21 0.0027 ENSG000000XXXXX_NR003286-2_RN18S1_rRNA 22 49 1153 1180 3.5e-07 4016_814 AGAGGGACAAGTGGCGTTTATTGCACTTGTCCCGGCCTGT -18.99 ENSG000000XXXXX_NR003286-2_RN18S1_rRNA 1 18 1446 1463 0.079 MIMAT0000092_MirBase_miR-92a_microRNA 19 40 1 22 0.00047 4521_718 CGGAAGATAACTATACAACCTACTGCCTTCCTGAGGTAGTAGGTTGTGTGGTTTC -30.53 MIMAT0004482_MirBase_let-7b*_microRNA 11 31 1 21 0.0034 MIMAT0000063_MirBase_let-7b_microRNA 32 53 1 22 0.00094 4766_680 TCCCTGAGACCCTAACTTGTGAGTGATGGGGATCGGGGATTGC -19.82 MIMAT0000423_MirBase_miR-125b_microRNA 1 22 1 22 0.00056 ENSG000000XXXXX_NR003286-2_RN18S1_rRNA 23 43 1598 1618 0.002

How does this compare to the result you received?

Also, one additional question - say a miRNA is listed first, and an mRNA is listed second in a single row. Does that mean that the chimera was a miRNA-first chimera, or are they ordered differently (i.e. alphabetical order)?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

SreeniEadara commented 2 years ago

Hi Greg,

Awesome! Glad to hear that the results file is similar, and good to know that the list order indicates order in the chimera.

I'm a bit confused about how to make the required databases to analyze using a different reference genome. I am able to rename target filenames in the Makefile and use 'make all' to make hg38.fasta.gz (human genome) as well as the provided hOH7-microRNA.fasta.gz, but the result after running hyb produces a result containing only hits between genomic loci.

Renaming hOH7-microRNA.fasta.gz to hg38-microRNA.fasta.gz, modifying the Makefile accordingly, and remaking the database produced the same result.

How would you recommend I set up the files before building the database? I am also trying to rename both files to start with hOH7 and I will see how it goes. Is there something here that I might be missing?

gkudla commented 2 years ago

Hi,

Assuming you have a fasta file "input.fasta" with the sequences you want to you use as your database, type this to produce the mapping database:

make_hyb_db input.fasta

You can then run hyb with the command:

HYB_DB=path/to/hyb/db hyb analyse in=data.fastq db=input

I recommend that the database contains transcripts with names formatted as in the hOH7 file distributed with hyb, but hyb should also work with a database composed of genomic or other sequences.

Greg

On Tue, 30 Aug 2022 at 14:57, SreeniEadara @.***> wrote:

Hi Tony,

Awesome! Glad to hear that the results file is similar, and good to know that the list order indicates order in the chimera.

I'm a bit confused about how to make the required databases to analyze using a different reference genome. I am able to rename target filenames in the Makefile and use 'make all' to make hg38.fasta.gz (human genome) as well as the provided hOH7-microRNA.fasta.gz, but the result after running hyb produces a result containing only hits between genomic loci.

Renaming hOH7-microRNA.fasta.gz to hg38-microRNA.fasta.gz, modifying the Makefile accordingly, and remaking the database produced the same result.

How would you recommend I set up the files before building the database? I am also trying to rename both files to start with hOH7 and I will see how it goes. Is there something here that I might be missing?

— Reply to this email directly, view it on GitHub https://github.com/gkudla/hyb/issues/8#issuecomment-1231707940, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABM3FBTLT6LYRORBL4ERPNLV3YHOLANCNFSM56XC7B6A . You are receiving this because you commented.Message ID: @.***>

SreeniEadara commented 2 years ago

Hi Greg,

Thanks! I think I understand the process for building the databases a bit better now.

Also wanted to add that in Ubuntu 20.04 LTS within Windows Subsystem for Linux, the following line worked a bit better for BLAT installation within the INSTALL script: make MACHTYPE=$MACHTYPE

SreeniEadara commented 2 years ago

Hi Greg,

I'm running into issues trying to use INSTALL on new Ubuntu 20.04 installations. I am able to get Hyb to work, but this involved building BLAT from source and making the default databases using 'make all'. I believe this is because rsync isn't a default package on 20.04 LTS, so after the directory is cleared the latest source isn't received. A modified INSTALL script worked better:

#!/bin/bash
#@(#)INSTALL  2022-08-22  A.J.Travis

#
# Install "hyb" under Ubuntu 20.04 LTS
#

# GitHub repository
export GITHUB=https://github.com/gkudla/hyb

# installation directory
if [ $USER == root ]; then
    export HYB_HOME=/usr/local/hyb
else
    export HYB_HOME=${HOME}/hyb
fi

# set PATH for "hyb" test run
export PATH=${HYB_HOME}/bin:$PATH
echo "Please add ${HYB_HOME}/bin to your PATH after running the INSTALL script"
echo "(press any key to continue...)"
read -n 1 key; echo

# download directory must be writeable
dir=$(pwd)
if [ ! -w ${dir} ]; then
    echo "$0: can't write to ${dir}"
    exit 1
fi

# check if "hyb" is already installed
if [ -e ${HYB_HOME} ]; then
    echo "$0: ${HYB_HOME} already exists - replace it?"
    read -n 1 key; echo
    if [ "$key" == "y" ]; then
        echo "alright . . . "
    else
        echo "$0: installation cancelled"
        exit 1
    fi
fi

# libpng-dev (required to compile BLAT)
if [ ! -r "/usr/include/libpng16/png.h" ]; then
    if [ $USER == root ]; then
        apt install libpng-dev
    else
        echo "$0: install libpng-dev to test hyb"
        exit 1
    fi
fi

# download and compile BLAT
wget -nc http://users.soe.ucsc.edu/~kent/src/blatSrc35.zip
unzip blatSrc35.zip
export MACHTYPE=$(arch)
mkdir -p ${HOME}/bin/${MACHTYPE}
cd blatSrc
make MACHTYPE=$MACHTYPE

# move to BLAT installation directory
if [ $USER == root ]; then
    mv -i ${HOME}/bin/${MACHTYPE}/* /usr/local/bin/
else
    export PATH=${HOME}/bin/${MACHTYPE}:${PATH}
fi

# build databases
cd ${HYB_HOME}/data/db
make

# Flexbar
if [ ! -x "$(which flexbar)" ]; then
    if [ $USER == root ]; then
        apt install flexbar
    else
        echo "$0: install flexbar to test hyb"
        exit 1
    fi
fi

# bowtie2
if [ ! -x "$(which bowtie2)" ]; then
    if [ $USER == root ]; then
        apt install bowtie2
    else
        echo "$0: install bowtie2 to test hyb"
        exit 1
    fi
fi

# UNAfold
if [ ! -x "$(which hybrid-min)" ]; then
    if [ $USER == root ]; then
        wget http://www.unafold.org/download/oligoarrayaux-3.8.tar.bz2
    tar xf oligoarrayaux-3.8.tar.bz2
    cd oligoarrayaux-3.8
    make install
    else
        echo "$0: install bio-linux-oligoarrayaux to test hyb"
        exit 1
    fi
fi

# Vienna RNA
if [ ! -x "$(which RNAfold)" ]; then
    if [ $USER == root ]; then
        wget https://www.tbi.univie.ac.at/RNA/download/ubuntu/ubuntu_20_04/viennarna_2.5.1-1_amd64.deb
        gdebi viennarna_2.5.1-1_amd64.deb
    else
        echo "$0: install Vienna RNA to test hyb"
        exit 1
    fi
fi

# test
cd ${HYB_HOME}/data/fastq
hyb analyse in=testdata.txt db=hOH7

# finished
exit 0

It seems there may be a decent number of packages that have to be installed outside of the INSTALL script, including rsync, wget, make, and unzip. The steps I followed during installation are here:

Ubuntu 20.04 LTS can be installed on Windows with the following command in Powershell (while running Powershell as an administrator):

wsl --install -d Ubuntu-20.04

Upon restart, an empty Linux shell will appear. You may need to press Enter to continue the installation. Hyb was installed as follows on Ubuntu 20.04 LTS. First, hyb source is cloned from GitHub:

git clone https://github.com/gkudla/hyb.git

Dependencies available on apt are installed:

sudo apt update
sudo apt install wget libpng-dev flexbar bowtie2 make gcc unzip ncbi-blast+ fastqc gdebi-core rnahybrid rsync

Package oligoarrayaux version 3.8 is installed as follows:

wget http://www.unafold.org/download/oligoarrayaux-3.8.tar.gz
gunzip oligoarrayaux-3.8.tar.gz
tar -xvf oligoarrayaux.tar
cd oligoarrayaux-3.8
./configure
make
make check
sudo make install
make clean

The SRA (Sequence Read Archive) tools must be downloaded and unzipped:

wget --output-document sratoolkit.tar.gz https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz
tar -vxzf sratoolkit.tar.gz

In order for the SRA tools to work, they must be added to the PATH. The PATH may reset with every new session.

export PATH=$PATH:$PWD/sratoolkit.3.0.0-ubuntu64/bin

The SRA tools must then be configured. This only needs to be performed once. Running the following command will launch the interactive SRA tools configuration utility. Under the “Cache” tab, the directory for local file caching should be set to an empty directory.

vdb-config -i

The viennaRNA package should then be installed:

wget https://www.tbi.univie.ac.at/RNA/download/ubuntu/ubuntu_20_04/viennarna_2.5.1-1_amd64.deb
wget https://www.tbi.univie.ac.at/RNA/download/ubuntu/ubuntu_20_04/python3-rna_2.5.1-1_amd64.deb
wget https://www.tbi.univie.ac.at/RNA/download/ubuntu/ubuntu_20_04/perl-rna_2.5.1-1_amd64.deb
sudo gdebi viennarna_2.5.1-1_amd64.deb
sudo gdebi python3-rna_2.5.1-1_amd64.deb
sudo gdebi perl-rna_2.5.1-1_amd64.deb

BLAT is installed as follows:

wget -nc http://users.soe.ucsc.edu/~kent/src/blatSrc35.zip
unzip blatSrc35.zip
export MACHTYPE=$(arch)
mkdir -p ${HOME}/bin/${MACHTYPE}
cd blatSrc
make MACHTYPE=$MACHTYPE
sudo mv -i ${HOME}/bin/${MACHTYPE}/* /usr/local/bin

Hyb includes a human transcriptome and miRNA database (hOH7) by default. Databases can be built as follows:

cd data/db
make all

You can test that Hyb was installed correctly with the following:

cd ..
cd fastq
hyb analyse in=testdata.txt db=hOH7

You can check the resulting .hyb files to verify that Hyb was successfully installed (there should be four, ending as follows:

“_hybrids.hyb”
“_hybrids_ua.hyb”
“_hybrids_ua_dg.hyb”
“_hybrids_ua_merged.hyb”

This procedure works well for me but may not be ideal for all users. Do you think you could post these instructions or modify the INSTALL script so that it works better on 20.04 LTS? Please let me know if there is something I am missing and INSTALL should be working normally. If you would like, I can also open a pull request to update the README with these instructions.

tony-travis commented 2 years ago

On 10/09/2022 19:01, SreeniEadara wrote:

Hi Greg,

I'm running into issues trying to use INSTALL on new Ubuntu 20.04 installations. I am able to get Hyb to work, but this involved building BLAT from source and making the default databases using 'make all'. I believe this is because rsync isn't a default package on 20.04 LTS, so after the directory is cleared the latest source isn't received. A modified INSTALL script worked better: [...]

Hi, SreeniEadara.

We developed "hyb" under "Bio-Linux", where most of the dependencies were already installed. I'll modify the INSTALL script to check that all your list of dependencies are installed before the script tries to install "hyb" and other dependencies not in the Ubuntu repositories.

Ideally, "hyb" should be a .deb package - That's a work in progress.

Thanks for your effort to get "hyb" o work under WSL,

Tony.

SreeniEadara commented 2 years ago

Hi Tony,

Sounds good! Let me know if I can help validate a new install script on WSL.

Sincerely, Sreenivas

tony-travis commented 1 year ago

Hi, SreeniEadara.

Sorry it's taken me so long to respond: I've just updated the INSTALL script, to include the missing dependencies that you suggested. Please let me know about any issues if you try it out.

Thanks for your interest in "hyb",

Tony.

gkudla commented 1 year ago

Hi Tony,

Can you please let me know how to use your install script?

thanks Greg

On Tue, 13 Dec 2022 at 23:13, Tony Travis @.***> wrote:

Hi, SreeniEadara.

Sorry it's taken me so long to respond: I've just updated the INSTALL script, to include the missing dependencies that you suggested. Please let me know about any issues if you try it out.

Thanks for your interest in "hyb",

Tony.

— Reply to this email directly, view it on GitHub https://github.com/gkudla/hyb/issues/8#issuecomment-1350005804, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABM3FBW24V6L3KGLAMWUZ2DWND7JTANCNFSM56XC7B6A . You are receiving this because you commented.Message ID: @.***>

tony-travis commented 1 year ago

On 14/12/2022 09:52, gkudla wrote:

Hi Tony,

Can you please let me know how to use your install script?

Hi, Greg.

It's just a "bash" shell script:

bash INSTALL

chmod +x INSTALL ./INSTALL

I think "hyb" should be distributed as a deb package, and I've discussed packaging it for Debian/Ubuntu with the Debian-Med team.

HTH,

Tony.