https://www.biorxiv.org/content/10.1101/2023.03.21.533716v1
Before creating conda environment, please ensure packages are installed.
conda create -n LocusMasterTE python=3.6 future pyyaml cython=0.29.7 numpy=1.16.3 pandas=1.1.3 scipy=1.2.1 intervaltree=3.0.2
conda activate LocusMasterTE
conda install -c bioconda htslib
pip install pysam==0.15.2
git clone https://github.com/jasonwong-lab/LocusMasterTE.git
cd LocusMasterTE
python3 setup.py build | python3 setup.py install
LocusMasterTE bulk assign -h
LocusMasterTE provides a preprocessing wrapper script from Long-Read FASTQs to LocusMasterTE input. \ First, tools need to be installed. You may refer website for installation.
\ 4 inputs are needed.
1. Genomic FASTA
2. Long Read FASTQ
3. Gene and TE GTF files
4. Path to Output Directory
Example code is below. \
bash LocusMasterTE/long_read_wrapper.sh hg38.fasta long_read.fastq.gz hg38.gtf output_path
If you have short-read FASTQ, here is the recommended code.
STAR --runThreadN 20 --genomeDir $outdir/STAR_index \
--readFilesIn $short_fq1 $short_fq2 --readFilesCommand zcat --outFileNamePrefix STAR_output \
--outSAMtype BAM Unsorted --outSAMstrandField intronMotif --outSAMattributes All --outSAMattrIHstart 0 \
--outFilterMultimapNmax 100 --outFilterScoreMinOverLread 0.4 --outFilterMatchNminOverLread 0.4 --clip3pNbases 0 \
--winAnchorMultimapNmax 100 --alignEndsType EndToEnd --alignEndsProtrude 100 DiscordantPair --chimSegmentMin 250 --twopassMode Basic
# sort by read name
samtools collate -o Aligned_sort_name.out.bam --output-fmt BAM Aligned_sort.out.bam
If you have short-read BAM file, BAM file needs to be sorted by read name.
Run command below. And output bam can be readily input in LocusMasterTE. \
samtools collate -o Aligned_sort_name.out.bam --output-fmt BAM Aligned_sort.out.bam
bash LocusMasterTE/data/run_sample.sh
A BAM file (sample_alignment_sort.bam
), annotation (annotation.gtf
) and long read TPM file (long_read_data.txt
) are included in
LocusMasterTE/data folder. \
Recommended command line is written in bash file (run_sample.sh
).
When inputting BAM file, it should be sorted by READ NAME. Otherwise, LocusMasterTE does not work properly.
Aligned by coordinate also is not applicable.
LocusMasterTE bulk assign
]LocusMasterTE was built upon Telescope. Additional arguments are elaborated.
long_read
Mandatory argument.
Path to long read file composed of three columns: "Geneid", "TPM", and "subF".
"Geneid" represents TE individual names followed by TPM values in "TPM" coulmn.
Belonging subfamily information from RepeatMasker database goes under "subF".
(default: None)
Run Modes:
--reassign_mode {exclude,choose,average,conf,unique,long_read}
Reassignment mode. After EM is complete, each fragment
is reassigned according to the expected value of its
membership weights. The reassignment method is the
method for resolving the "best" reassignment for
fragments that have multiple possible reassignments.
Available modes are: "exclude" - fragments with
multiple best assignments are excluded from the final
counts; "choose" - the best assignment is randomly
chosen from among the set of best assignments;
"average" - the fragment is divided evenly among the
best assignments; "conf" - only assignments that
exceed a certain threshold (see --conf_prob) are
accepted; "unique" - only uniquely aligned reads are
included. "long_read" - use long read to determine best hit.
NOTE: Results using all assignment modes are
included in the LocusMasterTE report by default. This
argument determines what mode will be used for the
"final counts" column. (default: exclude)
Model Parameters:
--rescue_short RESCUE_SHORT
To rescue features only captured by short,
values can be given to 0 expression captured in long read.
(default: 0)
--long_read_weight {float}
Weights on long-read information; No limited numeric ranges.
Higher number is recommended when matched tissue or cell type long-read is inputted.
Lower number is recommended when using different tissue samples.
(default: 1)
--prior_change {all,theta,none}
Integration of TPM counts from long reads.
All represents change in both pi and theta.
Change in theta influences only multimapping counts.
None is equivalent to not integrating long read
(default: all)
LocusMasterTE has three main output files: the transcript counts estimated via EM (LocusMasterTE-TE_counts.tsv
).\
The count file is most important for downstream analysis.