gkudla / hyb

hyb: a bioinformatics pipeline for the analysis of CLASH (crosslinking, ligation and sequencing of hybrids) data
GNU General Public License v3.0
13 stars 7 forks source link

Out of memory error with make_hyb_db #7

Closed divnand closed 4 years ago

divnand commented 4 years ago

I have been trying to generate a genome database using the mouse transcriptome. I downloaded the cDNA sequences from here (ftp://ftp.ensembl.org/pub/release-99/fasta/mus_musculus/cdna/). After a few hours of running, i get a out-of -memory error. How much memory would I need to build the db?

This is the error I see. /clusterfs/vector/scratch/dnandaku/hyb/bin/make_hyb_db: line 14: 8863 Killed makeblastdb -in $1 -dbtype nucl -input_type fasta -hash_index -out ${1/.fasta/} -logfile ${1/.fasta/.log} expandFaFastBuf: integer overflow when trying to increase buffer size from 2147483648 to a min of 10499. Building a LARGE index slurmstepd: error: Detected 2 oom-kill event(s) in step 5651184.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

gkudla commented 4 years ago

I have not come across an out-of-memory error with make_hyb_db. If you can't get this to work, try modifying make_hyb_db by replacing the line:

makeblastdb -in $1 -dbtype nucl -input_type fasta -hash_index -out ${1/.fasta/} -logfile ${1/.fasta/.log}

with:

formatdb -i $1 -p F -o T

This will make the blast database using the original blast package, rather than the blast+ package.

Alternatively, remove the makeblastdb line altogether. You will still be able to run hyb with blat or bowtie as the mapping program, but not with blast.

hope that helps. Greg

On Wed, Mar 4, 2020 at 1:44 PM divnand notifications@github.com wrote:

I have been trying to generate a genome database using the mouse transcriptome. I downloaded the cDNA sequences from here ( ftp://ftp.ensembl.org/pub/release-99/fasta/mus_musculus/cdna/). After a few hours of running, i get a out-of -memory error. How much memory would I need to build the db?

This is the error I see. /clusterfs/vector/scratch/dnandaku/hyb/bin/make_hyb_db: line 14: 8863 Killed makeblastdb -in $1 -dbtype nucl -input_type fasta -hash_index -out ${1/.fasta/} -logfile ${1/.fasta/.log} expandFaFastBuf: integer overflow when trying to increase buffer size from 2147483648 to a min of 10499. Building a LARGE index slurmstepd: error: Detected 2 oom-kill event(s) in step 5651184.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gkudla/hyb/issues/7?email_source=notifications&email_token=ABM3FBR3AHLW4JMKDEKVGGLRFZLLTA5CNFSM4LBHP2YKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4ISMDBYA, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABM3FBWD53MMX7SR7NV3MTTRFZLLTANCNFSM4LBHP2YA .

divyanandu commented 4 years ago

Thanks Greg. I ended up modifying the Makefile in the database folder to run using my database name as input and it worked perfectly. I will test what you recommended with the make_hyb_db file.

I had a few other questions about the pipeline which I couldn't quite figure out from the paper that I hope you can help me with.

  1. What happens to reads that can map to multiple transcripts? The -k parameter indicates that up to 20 alignments are reported. How are these different alignments processed and which ones are reported in the hyb file?

  2. What is the difference between the ua.hyb file and the hyb file? Is there a place I can access the description of the different output files?

  3. Is there a way to get counts for each of these hybrids?

  4. I have also been having trouble with using a genome db even with the anti=1 option. No hybrids are detected and all the files are empty although the same input file with the transcriptome database shows chimeras.

Sorry to ask so many questions! And thank you for your help!

gkudla commented 4 years ago

What happens to reads that can map to multiple transcripts? The -k parameter indicates that up to 20 alignments are reported. How are these different alignments processed and which ones are reported in the hyb file?

There are several hyb files (see below for explanation): the hybrids.hyb file reports all hybrids consistent with the mapping, and the hybrids_ua.hyb reports the best hybrids as judged by the sum of mapping e-values of fragments and several other criteria (see remove_duplicate_hybrids_hOH5.pl for details)

What is the difference between the ua.hyb file and the hyb file? Is there a place I can access the description of the different output files?

Please find below an explanation of the files generated by hyb. Most of the explanation is courtesy of Hywel Dunn-Davies from the Tollervey lab.

Briefly, assuming an input file called data.fasta, a hyb database called db, running the command 'hyb detect align=bowtie2 db=db format=comp in=data.fasta' would give you the following files (listed in the order in which they are generated):

data_comp_db.blast (generated by running bowtie2 on the fasta file, then sam2blast on the output): This is the basic alignment output file giving the alignment of all reads against the database (including non-chimeric reads)

data_comp_db_mtophits.blast (generated by running mtophits_blast on the blast file): This is a filtered version of the blast file including only the most significant hits (according to the e-value). If multiple hits are equally significant they are all kept. As far as I know the script assumes that in the blast output hits with the same sequence ID are ordered by e-value

data_comp_db_mtophits.ref (generated by running create_reference_data.pl on the mtophits.blast file): A reference data file containing a ranked list of genes, along with count statistics (i.e. the total uncollapsed number of reads associated with each gene / transcript, and the percentage of the total)

data_comp_db_singleE.blast (generated by running remove_duplicate_hits_blast.pl on the mtophits ref file and blast file): A filtered .blast file, with exactly one line per sequence ID

data_comp_db_singleE.blast_stats.txt (generated by running blast_stats on the singleE blast file): A text file containing the number of reads for each biotype

data_comp_db_hybrids.hyb (generated by get_mtop_hybrids.pl and some other scripts): A hyb file listing all of the hybrids, as described in the methods paper

data_comp_db_hybrids_ua.hyb (generated by running remove_duplicate_hybrids_hOH5.pl on the hyb file): A hyb file with duplicate hybrids removed

data_comp_db_hybrids_ua_dg.hyb A hyb file, same as *_ua.hyb, but with folding energy information added. NOTE: this is the file I would typically use for downstream analysis

data_comp_db_hybrids_ua_merged.hyb A hyb file, where the overlapping hybrids have been merged, similar to the bedtools merge operation.

If you need more detail, I would suggest looking through the code of the individual scripts. They are all in the bin folder of your hyb installation, and can be opened with a text editor.

Is there a way to get counts for each of these hybrids?

the *hybrids_ua_merged.hyb file reports counts and ID's of overlapping hybrids

I have also been having trouble with using a genome db even with the anti=1 option. No hybrids are detected and all the files are empty although the same input file with the transcriptome database shows chimeras.

I don't recommend running hyb with a genome database.

best, Greg

divyanandu commented 4 years ago

That was really helpful! Thank you so much. I really appreciate it.

I have one last question (hopefully!). It looks like the merged file is created as part of the analyse command in detect. We are not interested in miRNAs and so I don't really do the folding analysis of the data. Is calculating the folding energy a requirement for merging the files? When I try to directly run the combine_hyb_merge from the bin folder, I get the following error.

Can't locate Hybrid_long_2.pm in @INC (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /clusterfs/vector/scratch/dnandaku/hyb/bin/combine_hyb_merge line 30. BEGIN failed--compilation aborted at /clusterfs/vector/scratch/dnandaku/hyb/bin/combine_hyb_merge line 30.

gkudla commented 4 years ago

You can run combine_hyb_merge directly, without the folding analysis. Please see the link below, which explains how to tell perl where to look for the module Hybrid_long_2.pm

https://perlmaven.com/how-to-change-inc-to-find-perl-modules-in-non-standard-locations

On Fri, Mar 6, 2020 at 3:03 PM divyanandu notifications@github.com wrote:

That was really helpful! Thank you so much. I really appreciate it.

I have one last question (hopefully!). It looks like the merged file is created as part of the analyse command in detect. We are not interested in miRNAs and so I don't really do the folding analysis of the data. Is calculating the folding energy a requirement for merging the files? When I try to directly run the combine_hyb_merge from the bin folder, I get the following error.

Can't locate Hybrid_long_2.pm in @inc https://github.com/inc (@inc https://github.com/inc contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /clusterfs/vector/scratch/dnandaku/hyb/bin/combine_hyb_merge line 30. BEGIN failed--compilation aborted at /clusterfs/vector/scratch/dnandaku/hyb/bin/combine_hyb_merge line 30.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gkudla/hyb/issues/7?email_source=notifications&email_token=ABM3FBT76DQJJPAU7APWC6DRGEGEVA5CNFSM4LBHP2YKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOBU6EA#issuecomment-595808016, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABM3FBUOXBOHN5QHOXKCTS3RGEGEVANCNFSM4LBHP2YA .

divyanandu commented 4 years ago

Thanks! That worked!