A pipeline for clustering long 16S rRNA sequencing reads, or any sequences, into Operational Taxonomic Units.
> install.packages("seqinr")
The pipeline is designed for Pacbio CCS reads - it will not work on raw Pacbio reads.
The only input file to oclust is a file in FASTA format containing the sequencing reads to be clustered.
FASTQ files can be converted to FASTA:
$ cd utils
$ chmod +x fastq_to_fasta.pl
$ ./fastq_to_fasta.pl file.fastq > file.fasta
Get the repository:
$ git clone https://github.com/oscar-franzen/oclust.git oclust
Make executable (might not be necessary):
$ cd oclust
$ chmod +x *.pl
Decide if you want to compute distances based on Needleman-Wunsch or Infernal. The latter will be substantially faster.
First time executed, oclust_pipeline.pl
will download the human genome sequence and
format it.
$ ./oclust_pipeline.pl -x <method> -f <input file> -o <output directory> -p <number of CPUs>
General settings:
-x PW or MSA Can be PW for pairwise alignments (based on Needleman-Wunsch)
or MSA for multiple sequence alignment (based on
Infernal). [MSA]
-t local or cluster If -x is PW, should it be parallelized by running it locally
on multiple cores or by submitting jobs to a cluster
(requires a system with the LSF scheduler). [local]
-a complete, average or The desired clustering algorithm. [complete]
single
-f [string] Input fasta file.
-o [string] Name of output directory (must not exist) and use full path.
-R HMM, BLAST, or none Method to use for reverse complementing sequences. [HMM]
-p [integer] Number of processor cores to use for BLAST. [4]
-minl [integer] Minimum sequence length. [optional]
-maxl [integer] Maximum sequence length. [optional]
-rand [integer] Randomly sample a specified number of sequences. [optional]
-human Y or N If 'Y'es, then execute BLAST-based contamination
screen towards the human genome. [Y]
-chimera Y or N Run chimera check. Can be Y or N. [Y]
LSF settings (only valid for -x PW when -t cluster):
-lsf_queue [string] Name of the LSF queue to use. [scavenger]
-lsf_account [string] Name of the account to use. [optional]
-lsf_time [integer] Runtime hours per job specified as number of hours. [1]
-lsf_memory [integer] Requested amount of RAM in MB. [3000]
-lsf_nb_jobs [integer] Number of jobs. [20]
The oclust pipeline bundles together the following open source/public domain software:
$ ./configure --prefix=~/R/ --enable-static=yes --with-x=no --with-tcltk=no
$ ./configure --prefix=~/e/ --disable-shared --without-mysql --without-postgresql --without-axis2c --without-hpdf --without-x --without-pngdriver