L-GIREMI (Long-read Genome-independent Identification of RNA Editing by Mutual Information) is a method for identification of RNA editing sites from long-read RNA-seq data.
The L-GIREMI software was developed with python3 on Linux, which demands several python packages.
And the analysis with L-GIREMI also needs additional software:
It's recommended to install the software into a virtual environment.
Create virtual environment:
conda create -n l_giremi
conda activate l_giremi
Or:
virtualenv l_giremi
source bin/activate
Download from github:
git clone https://github.com/gxiaolab/L-GIREMI
cd L-GIREMI
Install the package:
python setup.py install
pip install l-giremi
Users should make sure to use reference files with the same assembly version and the same chromosome naming convention in all the processes.
L-GIREMI requires a genome fasta file as input. For the human genome, the fasta file can be obtained from UCSC or NCBI. The fasta file should include sequences from all the chromosomes.
L-GIREMI will use a SNP vcf file to get putative heterozygous SNPs. A vcf of dbSNP is acceptable. Alternatively, the users can provide a sample-specific SNP vcf file, if available. The vcf file should be converted into the bcf format, sorted, and indexed.
The following is an example flow to obtain dbSNP vcf and convert the dbSNP chromosome names into UCSC format.
wget "https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.38.gz"
wget "https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.38.gz.md5"
md5sum -c GCF_000001405.38.gz.md5
wget "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13/GCA_000001405.28_GRCh38.p13_assembly_report.txt"
egrep -v "^#" GCA_000001405.28_GRCh38.p13_assembly_report.txt | cut -f 7,10 | tr "\t" " " > GRCh38-to-hg38.map
bcftools annotate --rename-chrs GRCh38-to-hg38.map GCF_000001405.38.gz -Ob -o dbsnp.38.hg38.bcf
bcftools index dbsnp.38.hg38.bcf
A gene annotation gtf is required for L-GIREMI, which can be obtained from gencode. The gtf file should be sorted, zipped, and indexed.
The following is an example flow to prepare the gtf file:
(grep ^"#" gencode.v37.annotation.gtf; \
grep -v ^"#" gencode.v37.annotation.gtf | \
sort -k1,1 -k4,4n) | \
bgzip > gencode.v37.annotation.sorted.gtf.gz
tabix -p gff gencode.v37.annotation.sorted.gtf.gz
A table containing annotated simple repeats is used by L-GIREMI to pre-filter sites. It can be obtained from UCSC table browser. The table format should be: chromosome, start, end. No header is needed for the file.
usage: l-giremi [-h] -b BAM_FILE [-c [CHROMOSOMES ...]] [-o OUTPUT_PREFIX] [-t THREAD] --annotation_gtf ANNOTATION_GTF --genome_fasta GENOME_FASTA [--homopoly_length HOMOPOLY_LENGTH] [--keep_non_spliced_read]
[--max_het_snp_ratio MAX_HET_SNP_RATIO] [--mi_calculation_only] [--mi_min_common_read MI_MIN_COMMON_READ] [--mi_p_threshold MI_P_THRESHOLD] [--min_allele_depth MIN_ALLELE_DEPTH]
[--min_allele_ratio MIN_ALLELE_RATIO] [--min_het_snp_ratio MIN_HET_SNP_RATIO] [--min_total_depth MIN_TOTAL_DEPTH] [--min_dist_from_splice MIN_DIST_FROM_SPLICE] [--mode MODE] [--model MODEL]
[--padding_exon PADDING_EXON] [--padding_gene PADDING_GENE] --repeat_txt REPEAT_TXT [--skip_strand_correction] --snp_bcf SNP_BCF [--version]
L-GIREMI (Long-read RNA-seq Genome-independent Identification of RNA Editing by Mutual Information)
optional arguments:
-h, --help show this help message and exit
-b BAM_FILE, --bam_file BAM_FILE
input bam file, with cs tags, sorted and indexed
-c [CHROMOSOMES ...], --chromosomes [CHROMOSOMES ...]
chromosomes to be analyzed
-o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
prefix of output file
-t THREAD, --thread THREAD
cores to be used
--annotation_gtf ANNOTATION_GTF
gtf (gz and tabix indexed) file of genome annotation (gencode)
--genome_fasta GENOME_FASTA
path of genome fasta file
--homopoly_length HOMOPOLY_LENGTH
left and right sequence length to be searched for the homopoly around sites (default: 5)
--keep_non_spliced_read
Keep non spliced reads (default: without this parameter, reads without splicing would be dropped.)
--max_het_snp_ratio MAX_HET_SNP_RATIO
Max ratio to be considered as heterogenous SNPs (default: 0.65)
--mi_calculation_only
Only calculate the mutual information, without the modeling step. (default: calculate the steps after the MI step)
--mi_min_common_read MI_MIN_COMMON_READ
Min common read for site pairs to calculate MI (default: 6)
--mi_p_threshold MI_P_THRESHOLD
MI p value threshold to be used to separate RNA editing sites (default: 0.05)
--min_allele_depth MIN_ALLELE_DEPTH
Min mismatch allele read count to be considered (default: 3)
--min_allele_ratio MIN_ALLELE_RATIO
Min mismatch ratio to be considered (default: 0.05)
--min_het_snp_ratio MIN_HET_SNP_RATIO
Min ratio to be considered as heterogenous SNPs (default: 0.35)
--min_total_depth MIN_TOTAL_DEPTH
Min mismatch total read count to be considered (default: 6)
--min_dist_from_splice MIN_DIST_FROM_SPLICE
Drop sites within the distance from splice junctions (default: 4)
--mode MODE Mode to find mismatches in BAM: with cs tags (cs), or CIGAR and MD tags (cigar). (defalt: cs)
--model MODEL GLM model for scoring: lm, logistic. (defalt: lm)
--padding_exon PADDING_EXON
expand the range when searching exon gtf (default: 10)
--padding_gene PADDING_GENE
expand the range when searching gene gtf (default: 500)
--repeat_txt REPEAT_TXT
path of txt file of simple repeats [chromosom, start, end] (0 based)
--skip_strand_correction
skip the strand correction step to save runtime.
--snp_bcf SNP_BCF path of dbSNP bcf file
--version show program's version number and exit
Once the following files are available (see above):
First, align the long-read RNA-seq fastq file using
minimap2 with cs tag, which is
required by l-giremi
.
minimap2 -t 8 -ax splice -uf \
--secondary=no -N 5 --cs \
-o SAM_FILE GENOME_FILE FASTQ_FILE
Then, remove the unmapped and non-uniquely mapped reads from the SAM file, and sort it into an indexed BAM file.
samtools view -O BAM -F 2052 -h $SAM_FILE | \
samtools sort -O BAM -@ 7 -o $BAM_FILE -
samtools index $BAM_FILE
Next, run l-giremi
.
l-giremi \
-t 8 \
--bam_file $BAM_FILE \
--output_prefix $OUTPREFIX \
--genome_fasta $GENOME_FILE \
--snp_bcf $SNP_FILE \
--repeat_txt $REPEAT_FILE \
--annotation_gtf $GTF_FILE \
--chromosomes chr1 chr2 chr3 chr4 \
chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 \
chr13 chr14 chr15 chr16 chr17 chr18 chr19 \
chr20 chr21 chr22 chrX chrY
If only the MI calculation is needed, for the purpose of saving time, you can set the parameter to ignore the steps after the MI step.
l-giremi \
-t 8 \
--bam_file $BAM_FILE \
--output_prefix $OUTPREFIX \
--genome_fasta $GENOME_FILE \
--snp_bcf $SNP_FILE \
--repeat_txt $REPEAT_FILE \
--annotation_gtf $GTF_FILE \
--chromosomes chr1 chr2 chr3 chr4 \
chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 \
chr13 chr14 chr15 chr16 chr17 chr18 chr19 \
chr20 chr21 chr22 chrX chrY \
--mi_calculation_only
And, only the mutual information result would be output.
l-giremi
generates several output files:
$OUTPREFIX.strand.txt
.
columns:
$OUTPREFIX.mi.txt
.
columns:
$OUTPREFIX.mismatch.txt
.
columns:
$OUTPREFIX.score_performance
.
columns:
There are also some useful scripts provided with the main l-giremi code.
correct_read_strand
correct_read_strand
is used for the correction of strand of long
reads. It will use the information of splicing pattern of reads, gtf
annotations to correct the strand. The output is a txt file with 3
columns: read_name, original_read_strand, corrected_read_strand.
usage: correct_read_strand [-h] -b BAM_FILE [-c [CHROMOSOMES ...]] [-o OUTPUT_PREFIX] [-t THREAD] --annotation_gtf ANNOTATION_GTF --genome_fasta GENOME_FASTA [--keep_non_spliced_read] [--mode MODE] [--padding_exon PADDING_EXON]
[--padding_gene PADDING_GENE]
Correct read strand.
optional arguments:
-h, --help show this help message and exit
-b BAM_FILE, --bam_file BAM_FILE
input bam file, with cs tags, sorted and indexed
-c [CHROMOSOMES ...], --chromosomes [CHROMOSOMES ...]
chromosomes to be analyzed
-o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
prefix of output file
-t THREAD, --thread THREAD
cores to be used
--annotation_gtf ANNOTATION_GTF
gtf (gz and tabix indexed) file of genome annotation (gencode)
--genome_fasta GENOME_FASTA
path of genome fasta file
--keep_non_spliced_read
Keep non spliced reads (default: without this parameter, reads without splicing would be dropped.)
--mode MODE Mode to find mismatches in BAM: with cs tags (cs), or CIGAR and MD tags (cigar). (defalt: cs)
--padding_exon PADDING_EXON
expand the range when searching exon gtf (default: 10)
--padding_gene PADDING_GENE
expand the range when searching gene gtf (default: 500)
correct_splice_site
Since the raw splice sites detected in the long-read RNA-seq data may
not be accurate, a correction procedure is needed (Wyman and
Mortazavi, 2019). correct_splice_site
adopted similar correction
strategies with TranscriptClean. The correction needs a GTF annotation
file.
usage: correct_splice_site [-h] -s SPLICE_FILE --annotation_gtf ANNOTATION_GTF
[-c [CHROMOSOMES ...]] [-o OUTPUT_PREFIX] [--window WINDOW]
Correct splice sites by gtf file
optional arguments:
-h, --help show this help message and exit
-s SPLICE_FILE, --splice_file SPLICE_FILE
input splice file [read_name, chromosome, pos, type] with column names
--annotation_gtf ANNOTATION_GTF
gtf (gz and tabix indexed) file of genome annotation (gencode)
-c [CHROMOSOMES ...], --chromosomes [CHROMOSOMES ...]
chromosomes to be analyzed
-o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
prefix of output file
--window WINDOW window to be considered as the same splice site (default: 10)
get_aei
get_aei
is used for the calculation of Alu Editing Index (AEI)
from long-read RNA-seq data (Roth et al., 2019). Calculation of
AEI neads the read strand files (output of l-giremi
), the genome
fasta file, a txt file with Alu locations, and the SNP BCF reference.
usage: get_aei [-h] -b BAM_FILE --strand_file STRAND_FILE [-c [CHROMOSOMES ...]] [-o OUTPUT_PREFIX]
--genome_fasta GENOME_FASTA --snp_bcf SNP_BCF --alu_txt ALU_TXT [-t THREAD]
Calculate AEI from bam file, both in read level and Alu level
optional arguments:
-h, --help show this help message and exit
-b BAM_FILE, --bam_file BAM_FILE
input bam file, with cs tags, sorted and indexed
--strand_file STRAND_FILE
corrrect strand file for reads, generated by the L-GIREMI
-c [CHROMOSOMES ...], --chromosomes [CHROMOSOMES ...]
chromosomes to be analyzed
-o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
prefix of output file
--genome_fasta GENOME_FASTA
path of genome fasta file
--snp_bcf SNP_BCF path of dbSNP bcf file
--alu_txt ALU_TXT path of bed file of simple repeats [chromosom, start, end, name] (0 based)
-t THREAD, --thread THREAD
cores to be used
get_read_site
get_read_site
output a table that has the read name, genomic
position, and the nucleotide of the position on the read. Running the
script requires a file that provides the genomic position of sites
that are interested.
usage: get_read_site [-h] -s SITE_FILE -b BAM_FILE [-c [CHROMOSOMES ...]] [-o OUTPUT_PREFIX]
[--mapq_threshold MAPQ_THRESHOLD] [-t THREAD]
Get read and site information from bam file
optional arguments:
-h, --help show this help message and exit
-s SITE_FILE, --site_file SITE_FILE
input site file: [chromosome, pos], without header, separated by tab
-b BAM_FILE, --bam_file BAM_FILE
input bam file, with cs tags, sorted and indexed
-c [CHROMOSOMES ...], --chromosomes [CHROMOSOMES ...]
chromosomes to be analyzed
-o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
prefix of output file
--mapq_threshold MAPQ_THRESHOLD
Min MAPQ to be considered in bam file (default: 20)
-t THREAD, --thread THREAD
cores to be used
get_read_mismatch
get_read_mismatch
is similar to get_read_site
, except that
it only outputs the read name and the mismatched sites, without matched
sites. get_read_mismatch
also requires a file that provides the
genomic position of sites that are interested.
usage: get_read_mismatch [-h] -s SITE_FILE -b BAM_FILE [-c [CHROMOSOMES ...]] [-o OUTPUT_PREFIX]
[--mapq_threshold MAPQ_THRESHOLD] [-t THREAD]
Get read and mismatch site information from bam file
optional arguments:
-h, --help show this help message and exit
-s SITE_FILE, --site_file SITE_FILE
input site file: [chromosome, pos], without header, separated by tab
-b BAM_FILE, --bam_file BAM_FILE
input bam file, with cs tags, sorted and indexed
-c [CHROMOSOMES ...], --chromosomes [CHROMOSOMES ...]
chromosomes to be analyzed
-o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
prefix of output file
--mapq_threshold MAPQ_THRESHOLD
Min MAPQ to be considered in bam file (default: 20)
-t THREAD, --thread THREAD
cores to be used
get_read_splice
get_read_splice
outputs read names and the splicing sites
detected by minimap2 in the reads.
usage: get_read_splice [-h] -b BAM_FILE [-c [CHROMOSOMES ...]] [-o OUTPUT_PREFIX]
[--mapq_threshold MAPQ_THRESHOLD] [-t THREAD]
Get read and splice sites from bam file
optional arguments:
-h, --help show this help message and exit
-b BAM_FILE, --bam_file BAM_FILE
input bam file, with cs tags, sorted and indexed
-c [CHROMOSOMES ...], --chromosomes [CHROMOSOMES ...]
chromosomes to be analyzed
-o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
prefix of output file
--mapq_threshold MAPQ_THRESHOLD
Min MAPQ to be considered in bam file (default: 20)
-t THREAD, --thread THREAD
cores to be used
calculate_site_splice_mi
calculate_site_splice_mi
can calculate the mutual information for
mismatched sites (RNA editing sites or SNPs) and splicing sites.
usage: calculate_site_splice_mi [-h] [-m READ_SITE] [-s READ_SPLICE] [-o OUTPUT_PREFIX]
calculate the mutual information of mismatch site and splice site pairs
optional arguments:
-h, --help show this help message and exit
-m READ_SITE, --read_site READ_SITE
read-site file: [read_name, chromosome, pos, seq]
-s READ_SPLICE, --read_splice READ_SPLICE
corrected read-splice file: [read_name, chromosome, pos, type, corrected_pos,
annotation]
-o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
prefix of output file
split_bam_by_site
split_bam_by_site
can fetch reads that cover one genomic location
and save the reads into separate SAM files by the location, which can
be used in IGV ploting.
usage: split_bam_by_site [-h] -b BAM_FILE [-o OUTPUT_PREFIX] [-c CHROMOSOME] [-p POS]
Save reads that cover one position into separated SAM files by the genomic location
optional arguments:
-h, --help show this help message and exit
-b BAM_FILE, --bam_file BAM_FILE
input bam file, with cs tags, sorted and indexed
-o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
prefix of output file
-c CHROMOSOME, --chromosome CHROMOSOME
-p POS, --pos POS