RelocaTE-1-0-5
RelocaTE: is a collection of scripts in which short reads (paired or unpaired), a fasta containing the sequences of transposable elements and a reference genome sequence are the input and the output is a series of files containing the locations (relative to the reference genome) of TE insertions in the reference and short reads
CharacTErizer is a companion tool that compares the numbers of reads that flank the TE sequence and contain genomic sequence to the number of reads that span a predicted insertion site with no gaps. These spanners contain no TE sequence. The ratio of spanners to flankers is used to classify the insertion as homozygous, heterozygous, or new (somatic). Somatic excision events can also be predicted.
Updates
-t | --te_fasta Str: TE FASTA File -d | --fq_dir Str: directory of fq files -g | --genome_fasta Str: reference genome fasta file -e | --exper Str: Sample identifier -o | --outdir Str: output directory name -1 | --mate_1_id Str: unique mate/pair 1 string -2 | --mate_2_id Str: unique mate/pair 2 string -u | --unpaired_id Str: unique unPaired string -p | --parallel 0|1: split into many jobs -a | --qsub_array 0|1: create PBS array job script -w | --workingdir Str: working directory name -l | --len_cutoff n: min length cutoff for alignment to reference -bm | --blat_minScore n: blat minScore for alignment to TE -bt | --blat_tileSize n: blat tileSize for alignmetn to TE -m | --mismatch n: fraction (0 or less ex. 0.1) mismatches allowed in blat alignment to TE -f | --flanking_seq_len n: length of the insertion site flanking seq to be returned -r | --reference_ins Str: To identify reference and shared insertions (reference and reads) choose option-1 or option-2. option-1) use '-r 1' to have RelocaTE find the location of your TE in the reference. option-2) input the file name of a tab-delimited file containing the coordinates of TE insertions pre-existing in the reference sequence. [no default]
Required. No default value.
The file name of the fasta file containing the nucleotide sequence of one or many transposable elements. The sequence should include the complete terminal inverted repeats (TIRs) [or LTR] but not include the target site in the sequence proper. The target site should be provided in the description portion of the fasta file in the following format, TSD=xyz. The TSD will be searched for in the both the forward and reverse strand. [[after testing the use of no TSD, write weather TSD= can be used]]
SAMPLE TE FASTA:
>mping TSD=TTA GGCCAGTCACAATGGGGGTTTCACTGGTGTGTCATGCACATTTAATAGGGGTAAGACTGAATAAAAA ATGATTATTTGCATGAAATGGGGATGAGAGAGAAGGAAAGAGTTTCATCCTGGTGAAACTCGTCAGC GTCGTTTCCAAGTCCTCGGTAACAGAGTGAAACCCCCGTTGAGGCCGATTCGTTTCATTCACCGGAT CTCTTGCGTCCGCCTCCGCCGTGCGACCTCCGCATTCTCCCGCGCCGCGCCGGATTTTGGGTACAAA TGATCCCAGCAACTTGTATCAATTAAATGCTTTGCTTAGTCTTGGAAACGTCAAAGTGAAACCCCTC CACTGTGGGGATTGTTTCATAAAAGATTTCATTTGAGAGAAGATGGTATAATATTTTGGGTAGCCGT GCAATGACACTAGCCATTGTGACTGGCC
The fasta must contain TSD=
This can be a Perl regular expression.
Example: these exact characters TTA: TSD=TTA Example: any 4 characters: TSD=.... Example: A or T followed by GCC: TSD=(A|T)GCC Example: CGA followed by any character then an A then CT or G: TSD=CGA.A(CT|G)
Required. No default value.
The name of the directory of paired and unpaired fastq files (paired _p1.fq & _p2.fq). Both the .fq and .fastq extensions are accepted as extensions of fastq files. If something different is used RelocaTE will not recognize those files as being fastq files.
Optional, Recommended. No default value.
The file name of the fasta file containing the genome sequence of the reference. If it is provided, the locations of the insertion events will be reported relative to the reference genome coordinates.
If the genome sequence is not provided a series of files will be generated. One set will contain the intact reads that align to the TE. The second and third set of files will be made up of trimmed reads. The second set will be only the trimmed portion of the reads in the first set that align to the TE. The third set will contain the trimmed portion of the reads that do not align to the TE, therefore the portion of the reads that should align to the genome sequence not containing a TE insertion.
Optional, Recommended. The default value is not.given
A short string for sample name. This string will be used in the output files to create IDs for the insert (ex. A123)
Optional, Recommended. The default value is outdir_teSearch
A short string for the output directory name. This string will be used to create a directory to contain the output files and directories in the current working directory. The complete path is not required, only the desired name for the directory.
Optional, Recommended. The default value is _p1
A string to identify mate 1 paired files. Should contain the unique text and any text between the unique text and the fq extension. This string will be used in a regular expression to identify the files as a mate 1 file, so the string should not be found in the mate 2 file or the unpaired files
Example:
If the files are named as such: file_1.fq The string would be: _1 File: file_1.noNumbers.fq String: _1.noNumbers File: file_1_1.fq (and mate = file_1_2.fq) String: _1 Issue with last scenario: _1 will recognize both mates. Suggestion: rename files to file_1_p1.fq and file_1_p2.fq. Now the string _p1 can be used to uniquely identify all _p1 files and no _p2 files.
Optional, Recommended. The default value is _p2
See -1 for a more in depth explanation.
Example:
File: file_p2.fq String: _p2
Optional, Recommended. The default value is .unPaired.
See -1 for a more in depth explanation.
Example:
File: file.unParied.fq String: .unParied
Optional. Default value is 1.
n is 0 or 1.
0: means only one large job will be run.
1: many shell scripts will be generated for the user to manually run
Break down the single big job of relocaTE into as many smaller jobs as possible. If selected this option will cause the creation of shell scripts which can be manually ran or submitted to a queue. This enables the jobs to be run in parallel. The folders of shell scripts should be run as ordered. Step_1 needs to run and be complete before Step_2 jobs can be proper started. If the genome fasta had already been split and indexed this job will be skipped.
Optional. Default value is 1.
n is 0 or 1.
0: Nothing will be done
1: if -p 1 then array jobs will be created
If -p 1 then -a can be 1, otherwise it will be 0. If 1, in addition to the shell scripts generated from -p 1, qsub PBS array job scripts will be made for easier submittion of the shell scripts to the queue. See man qsub option -t
for more information.
Submit each array job one at a time, waiting for the previous job to be completed before submitting the next.
See run_these_jobs.sh for the array jobs.
Optional. Default value is the current working directory.
If a directory different form the cwd is given it needs to exist, will not create. Provide the full path.
Optional. Default value is 10.
n is a value for the length cutoff. This is the minimum length that a read needs to be after the removal of TE sequence. This trimmed read will be aligned to the genome. When selecting a custom value consider these points:
Optional. Default value is 10.
n is used for the blat minScore value for the comparison of reads to the TE sequence.
Excerpt directly from Blat manual:
-minScore=N sets minimum score. This is the matches minus the mismatches minus some sort of gap penalty.
Optional. Default value is 0.
Any number less than or equal to 0.
Fraction of the bps that aligned to the TE that are allowed to not be an exact match. For example, if 10 bp align to the TE and the allowance is 0.1, 1 bp can be a mismatch.
Optional. Default value is 7.
n is used for the blat tileSize value for the comparison of reads to the the TE sequence.
Excerpt directly from Blat manual:
-tileSize=N sets the size of match that triggers an alignment.
Usually between 8 and 12 Default is 11 for DNA and 5 for protein.
Optional. Default value is 100.
n is the length of the sequence flanking the insertion site that will be returned in an output fasta file and in the output gff file. This sequence is taken from the reference genome.
Optional. No default value.
If manaully providing the coordinates the required format is two columns, neither column have any white space. The first colum is the TE name. Then a tab separates the first column from the second column. And the second column contains the reference sequence name as described in the reference fasta, the starting bp, a colon, double periods, and finally the ending bp.
SAMPLE Existing TE File:
mping Chr12:839604..840033 mping Chr12:1045463..1045892
1. Get the sequence of your TE, including the TIRs. Create a fasta file with your sequence, TE name and the TSD. The TSD= is required. With DNA transposons, by definition, during an insertion event the target site is duplicated. Therefore the target site will be used to identify an insertion event. The reverse complement of each read containing portions of the ends of the provided TE will also be searched for the TSD to identify insertion events. A specific sequence of nucleotides can be used or a perl regular expression. For example:
TSD=TTA
TSD=TT[AG](C|GT).
any 4 bp TSD=....
Example TE Fasta File:
>mping TSD=TTA GGCCAGTCACAATGGGGGTTTCACTGGTGTGTCATGCACATTTAATAGGGGTAAGACTGAATAAAA AATGATTATTTGCATGAAATGGGGATGAGAGAGAAGGAAAGAGTTTCATCCTGGTGAAACTCGTCA GCGTCGTTTCCAAGTCCTCGGTAACAGAGTGAAACCCCCGTTGAGGCCGATTCGTTTCATTCACCG GATCTCTTGCGTCCGCCTCCGCCGTGCGACCTCCGCATTCTCCCGCGCCGCGCCGGATTTTGGGTA CAAATGATCCCAGCAACTTGTATCAATTAAATGCTTTGCTTAGTCTTGGAAACGTCAAAGTGAAAC CCCTCCACTGTGGGGATTGTTTCATAAAAGATTTCATTTGAGAGAAGATGGTATAATATTTTGGGT AGCCGTGCAATGACACTAGCCATTGTGACTGGCC
2. Get the short reads together.
- Check the name of your files. If they are paired they should have some indication of being pair 1 and pair 2.
- Determine the pattern to indicate the 1st and 2nd pair
Example:
Flowcell25_lane1_pair1.fastq and Flowcell25_lane1_pair2.fastq _pair1 would match the first paired file _pair2 would match the second paired file
3. Get the reference genome fasta. This will be one file containing all the reference sequences of your organism.
4. If you want to determine if your short reads contain the same TE insertions as the reference, use the '-r 1' option
OR, use -r "filename": you can create a file with the TE name and the coordinates in the reference:
SAMPLE:
mping Chr12:839604..840033 mping Chr12:1045463..1045892
5. Do you have a queue system available to you, or do you have multiple processors? If so, you can select the -p 1 option. This will create a series of shell scripts that you can submit to the queue or run on multiple processors.
- These should be run in order. For example all of step_1 should be run and completed before the shell scripts of step_2 are run, and so on.
6. Do you have a queue system available and is it PBS? If so, you can select the -a 1 option to generate array job scripts. These jobs can be submitted to the queue instead of submitting hundreds of individual shell scripts. The array job will submit the hundreds of scripts for you.
- Make sure to submit the array jobs in order and wait for the job to complete before submitting the next one.
7. You are now ready to run relocaTE.pl with your data. If you run the program without any command line options, it will print out a list of the options and short descriptions.
sample_name.TE_name.all_inserts.gff GFF3 file containing all reference and non-reference insertions sample_name.TE_name.all_nonref.txt tab-delimited file containing all potential non-reference (insertions found only in reads, absent from reference) insertion sites sample_name.TE_name.confident_nonref.txt tab-delimited file containing only the confident non-reference insertion sites sample_name.TE_name.confident_nonref_genomeflanks.fa FASTA file containing the genome sequence which flanks each confident non-reference site sample_name.TE_name.confident_nonref_reads_list.txt text file containing a list of the reads used to call each confident non-reference insertion sample_name.TE_name.all_reference.txt text file containing counts of reads which overlap the reference insertions.
If you have a multi-node cluster you can speed up your RelocaTE run by dividing your fastq files into many smaller files. Dividing your reads files into many files of 1 million reads is a nice way to improve the speed of the analysis. A script included in the package can be used to do this. It is called fastq_split.pl.
perl fastq_split.pl Please Provide: -s int The number of sequences per file [1_000_000] -o dir The name of the directory to output the files followed by a list of files to be split Usage: ./fastq_split -o split_fq ~/somedir/someRandom.fq ./fastq_split -o split_fq ~/somedir/*fq
usage: ./characterizer.pl [-s relocaTE table output file][-b bam file or dir of bam files][-g reference genome fasta file][-h] options: -s file relocaTE output file: YOURSAMPLENAME.mping.all_nonref.txt [no default] -b dir/file bam file of the orginal reads aligned to reference (before TE was trimmed) or directory of bam files [no default] -g file reference genome fasta file [no default] -x int find excision events that leave a footprint, yes or no (1|0) [0] -h this help message For more information see documentation: http://srobb1.github.com/RelocaTE/
#create bwa index file bwa index -a bwtsw MSUr7.sample.fa #Align Pair 1 fastq bwa aln MSUr7.sample.fa fq/sample_p1.fq > sample_p1.sai #align Pair 2 fastq bwa aln MSUr7.sample.fa fq/sample_p2.fq > sample_p2.sai #generate SAM for paired reads bwa sampe MSUr7.sample.fa sample_p1.sai sample_p2.sai fq/sample_p1.fq fq/sample_p2.fq > sample.paired.sam <<<<<<< .merge_file_H6775A #align unparied bwa aln MSUr7.sample.fa fq/sample.unPaired.fq > sample.unPaired.sai #generate SAM for unpaired reads bwa samse MSUr7.sample.fa sample.unPaired.sai fq/sample.unPaired.fq > sample.unPaired.sam ======= #align unparied bwa aln MSUr7.sample.fa fq/sample.unPaired.fq > sample.unPaired.sai #generate SAM for unpaired reads bwa samse MSUr7.sample.fa sample.unPaired.sai fq/sample.unPaired.fq > sample.unPaired.sam >>>>>>> .merge_file_yAfFKy #generate BAM with SAMtools samtools view -h -b -S -T MSUr7.sample.fa sample.paired.sam > sample.paired.bam samtools view -h -b -S -T MSUr7.sample.fa sample.unPaired.sam > sample.unPaired.bam #combine BAM samtools cat -o sample.bam sample.unPaired.bam sample.paired.bam #sort BAM with SAMtools samtools sort sample.bam sample.sorted #index BAM with SAMtools samtools index sample.sorted.bam
For any of the listed reasons, or anything else, please leave us a message here