liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
272 stars 47 forks source link

Recommended database for getting started with human single-celled data #108

Closed jolespin closed 2 years ago

jolespin commented 2 years ago

I'm taking my first dive into single-celled human data with BCR/TCR sequence construction. Is there a database that you would recommend to start out with? You mentioned IMGT but there are several versions. Can you link which one you would recommend to start?

mourisl commented 2 years ago

We have provided the IMGT database sequence in the repository already (human_IMGT+C.fa), so you can start with it.

jolespin commented 2 years ago

Thank you. I'll use this to start.

jolespin commented 2 years ago

A few follow up questions:

  1. When would you use hg38_bcrtcr.fa instead of human_IMGT+C.fa?
  2. Is it typical to use the same database (e.g., human_IMGT+C.fa) for both the -f and --ref arguments?
  3. Have you experienced this error?
    [Thu Dec 23 09:31:50 2021] SYSTEM CALL: perl /local/ifs2_devel/ANNOTATION/jespinoz/anaconda3/envs/trust4_env/bin/trust-airr.pl test_cdr3.out test_annot.fa > test_airr.tsv
    Argument "assemble3" isn't numeric in int at /local/ifs2_devel/ANNOTATION/jespinoz/anaconda3/envs/trust4_env/bin/trust-airr.pl line 144, <FP> line 1.
mourisl commented 2 years ago
  1. For BAM file input, you have to use hg38_bcrtcr.fa as the -f input. For the fastq input, the results from using hg38_bcrtcr.fa or human_IMGT+C.fa are almost identical. Since hg38_bcrtcr.fa has fewer redundant sequences and could improve the running time, I usually use hg38_bcrtcr.fa file for -f option.
  2. Yes. If you want to use the same file on both options, you can actually just use "-f human_IMGT+C.fa" without --ref for short.
  3. Is your TRUST4 version the most recent github version? I think in the current version, the trust-airr.pl command takes test_report.tsv as input instead of test_cdr3.out.
jolespin commented 2 years ago
  1. For BAM file input, you have to use hg38_bcrtcr.fa as the -f input. For the fastq input, the results from using hg38_bcrtcr.fa or human_IMGT+C.fa are almost identical. Since hg38_bcrtcr.fa has fewer redundant sequences and could improve the running time, I usually use hg38_bcrtcr.fa file for -f option.

For the BAM file, is it the reads mapped to what was used with the --ref or -f flag?

  1. Yes. If you want to use the same file on both options, you can actually just use "-f human_IMGT+C.fa" without --ref for short.

Got it so if nothing is used for --ref then it automatically propagates the argument from -f (where -f is mandatory)?

  1. Is your TRUST4 version the most recent github version? I think in the current version, the trust-airr.pl command takes test_report.tsv as input instead of test_cdr3.out.

Here's my version:

(trust4_env) -bash-4.2$ conda list | grep "trust4"
# packages in environment at /usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/trust4_env:
trust4                    1.0.6                h2e03b76_0    bioconda

Also, here was my full command for a few more details. First the main script couldn't find trust-simplerep.pl in the path so I copied it over to bin.

(trust4_env) -bash-4.2$ run-trust4 -1 trimmed_repaired_1.fastq.gz -2 trimmed_repaired_2.fastq.gz -t 4 -f ../TRUST4/human_IMGT+C.fa --ref ../TRUST4/human_IMGT+C.fa -o test
[Thu Dec 23 09:25:39 2021] TRUST4 begins.
[Thu Dec 23 09:25:39 2021] SYSTEM CALL: /local/ifs2_devel/ANNOTATION/jespinoz/anaconda3/envs/trust4_env/bin/fastq-extractor -t 4 -f ../TRUST4/human_IMGT+C.fa -o test_toassemble  -1 trimmed_repaired_1.fastq.gz -2 trimmed_repaired_2.fastq.gz
[Thu Dec 23 09:25:39 2021] Start to extract candidate reads from read files.
[Thu Dec 23 09:26:21 2021] Finish extracting reads.
[Thu Dec 23 09:26:21 2021] SYSTEM CALL: /local/ifs2_devel/ANNOTATION/jespinoz/anaconda3/envs/trust4_env/bin/trust4  -t 4 -f ../TRUST4/human_IMGT+C.fa -o test -1 test_toassemble_1.fq -2 test_toassemble_2.fq
[Thu Dec 23 09:26:21 2021] Found 2088 reads.
[Thu Dec 23 09:26:21 2021] Finish sorting the reads.
[Thu Dec 23 09:26:21 2021] Finish rough annotations.
[Thu Dec 23 09:26:21 2021] Assembled 78 reads.
[Thu Dec 23 09:26:21 2021] Try to rescue 0 reads for assembly.
[Thu Dec 23 09:26:21 2021] Rescued 0 reads.
[Thu Dec 23 09:26:21 2021] Extend assemblies by mate pair information.
[Thu Dec 23 09:26:21 2021] Remove redundant assemblies.
[Thu Dec 23 09:26:21 2021] Finish assembly.
[Thu Dec 23 09:26:21 2021] SYSTEM CALL: /local/ifs2_devel/ANNOTATION/jespinoz/anaconda3/envs/trust4_env/bin/annotator -f ../TRUST4/human_IMGT+C.fa -a test_final.out -t 4 -o test  -r test_assembled_reads.fa > test_annot.fa
[Thu Dec 23 09:26:21 2021] Start to annotate assemblies.
[Thu Dec 23 09:26:21 2021] Start to realign reads for CDR3 analysis.
[Thu Dec 23 09:26:21 2021] Compute CDR3 abundance.
[Thu Dec 23 09:26:21 2021] Finish annotation.
[Thu Dec 23 09:26:21 2021] SYSTEM CALL: perl /local/ifs2_devel/ANNOTATION/jespinoz/anaconda3/envs/trust4_env/bin/trust-simplerep.pl test_cdr3.out  > test_report.tsv
[Thu Dec 23 09:26:22 2021] SYSTEM CALL: perl /local/ifs2_devel/ANNOTATION/jespinoz/anaconda3/envs/trust4_env/bin/trust-airr.pl test_cdr3.out test_annot.fa > test_airr.tsv
Can't open perl script "/local/ifs2_devel/ANNOTATION/jespinoz/anaconda3/envs/trust4_env/bin/trust-airr.pl": No such file or directory
system perl /local/ifs2_devel/ANNOTATION/jespinoz/anaconda3/envs/trust4_env/bin/trust-airr.pl test_cdr3.out test_annot.fa > test_airr.tsv failed: 512 at /usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/trust4_env/bin/run-trust4 line 48.

# Checking if file exists
(trust4_env) -bash-4.2$ less /local/ifs2_devel/ANNOTATION/jespinoz/anaconda3/envs/trust4_env/bin/trust-airr.pl
/local/ifs2_devel/ANNOTATION/jespinoz/anaconda3/envs/trust4_env/bin/trust-airr.pl: No such file or directory
# Copying it over
(trust4_env) -bash-4.2$ cp trust-airr.pl /local/ifs2_devel/ANNOTATION/jespinoz/anaconda3/envs/trust4_env/bin/trust-airr.pl
# Rerun it
(trust4_env) -bash-4.2$ run-trust4 -1 trimmed_repaired_1.fastq.gz -2 trimmed_repaired_2.fastq.gz -t 4 -f ../TRUST4/human_IMGT+C.fa --ref ../TRUST4/human_IMGT+C.fa -o test
[Thu Dec 23 09:31:08 2021] TRUST4 begins.
[Thu Dec 23 09:31:08 2021] SYSTEM CALL: /local/ifs2_devel/ANNOTATION/jespinoz/anaconda3/envs/trust4_env/bin/fastq-extractor -t 4 -f ../TRUST4/human_IMGT+C.fa -o test_toassemble  -1 trimmed_repaired_1.fastq.gz -2 trimmed_repaired_2.fastq.gz
[Thu Dec 23 09:31:08 2021] Start to extract candidate reads from read files.
[Thu Dec 23 09:31:49 2021] Finish extracting reads.
[Thu Dec 23 09:31:49 2021] SYSTEM CALL: /local/ifs2_devel/ANNOTATION/jespinoz/anaconda3/envs/trust4_env/bin/trust4  -t 4 -f ../TRUST4/human_IMGT+C.fa -o test -1 test_toassemble_1.fq -2 test_toassemble_2.fq
[Thu Dec 23 09:31:49 2021] Found 2088 reads.
[Thu Dec 23 09:31:49 2021] Finish sorting the reads.
[Thu Dec 23 09:31:49 2021] Finish rough annotations.
[Thu Dec 23 09:31:49 2021] Assembled 78 reads.
[Thu Dec 23 09:31:49 2021] Try to rescue 0 reads for assembly.
[Thu Dec 23 09:31:49 2021] Rescued 0 reads.
[Thu Dec 23 09:31:50 2021] Extend assemblies by mate pair information.
[Thu Dec 23 09:31:50 2021] Remove redundant assemblies.
[Thu Dec 23 09:31:50 2021] Finish assembly.
[Thu Dec 23 09:31:50 2021] SYSTEM CALL: /local/ifs2_devel/ANNOTATION/jespinoz/anaconda3/envs/trust4_env/bin/annotator -f ../TRUST4/human_IMGT+C.fa -a test_final.out -t 4 -o test  -r test_assembled_reads.fa > test_annot.fa
[Thu Dec 23 09:31:50 2021] Start to annotate assemblies.
[Thu Dec 23 09:31:50 2021] Start to realign reads for CDR3 analysis.
[Thu Dec 23 09:31:50 2021] Compute CDR3 abundance.
[Thu Dec 23 09:31:50 2021] Finish annotation.
[Thu Dec 23 09:31:50 2021] SYSTEM CALL: perl /local/ifs2_devel/ANNOTATION/jespinoz/anaconda3/envs/trust4_env/bin/trust-simplerep.pl test_cdr3.out  > test_report.tsv
[Thu Dec 23 09:31:50 2021] SYSTEM CALL: perl /local/ifs2_devel/ANNOTATION/jespinoz/anaconda3/envs/trust4_env/bin/trust-airr.pl test_cdr3.out test_annot.fa > test_airr.tsv
Argument "assemble3" isn't numeric in int at /local/ifs2_devel/ANNOTATION/jespinoz/anaconda3/envs/trust4_env/bin/trust-airr.pl line 144, <FP> line 1.
[Thu Dec 23 09:31:50 2021] TRUST4 finishes.

Hope this helps!

mourisl commented 2 years ago
  1. The BAM input means the reads is already aligned with tools like HISAT or STAR. So to extract candidate reads, TRUST4 checks the aligned reads to see whether they are falling into the V, J, C genes, and the gene coordinate is provided in the hg38_bcrtcr.fa file. Since CDR3 is not represented on the reference genome, it is important to contain the unmapped reads in the BAM file.
  2. Exactly, -f is a mandatory optioin.
  3. This error is very strange since the v1.0.6 line 144 does not need numeric information. It feels like the script is from the current github repo. Could you please download and compile using github version with "git clone"? Thanks.
jolespin commented 2 years ago

Thank you. I just ran make right now. Which files should be copied to bin in my conda environment? (i.e., /usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/trust4_env/bin/)

(base) -bash-4.2$ tree .
.
├── AlignAlgo.hpp
├── alignments.hpp
├── annotator
├── Annotator.cpp
├── Annotator.o
├── bam-extractor
├── BamExtractor.cpp
├── BamExtractor.o
├── BarcodeCorrector.hpp
├── BuildDatabaseFa.pl
├── BuildImgtAnnot.pl
├── BuildImgtVquestAnnot.pl
├── Cgene.list
├── defs.h
├── example
│   ├── example_1.fq
│   ├── example_2.fq
│   ├── example.bam
│   ├── TRUST_example_annot.fa
│   ├── TRUST_example_cdr3.out
│   └── TRUST_example_report.tsv
├── fastq-extractor
├── FastqExtractor.cpp
├── FastqExtractor.o
├── FilterAnnotatedAssembly.pl
├── hg19_bcrtcr.fa
├── hg38_bcrtcr.fa
├── human_IMGT+C.fa
├── human_vdjc.list
├── KmerCode.hpp
├── KmerCount.hpp
├── KmerIndex.hpp
├── kseq.h
├── LICENSE.txt
├── main.cpp
├── main.o
├── Makefile
├── mouse
│   ├── GRCm38_bcrtcr.fa
│   └── mouse_IMGT+C.fa
├── ReadFiles.hpp
├── README.md
├── run-trust4
├── samtools-0.1.19
│   ├── AUTHORS
│   ├── bam2bcf.c
│   ├── bam2bcf.h
│   ├── bam2bcf_indel.c
│   ├── bam2bcf_indel.o
│   ├── bam2bcf.o
│   ├── bam2depth.c
│   ├── bam2depth.o
│   ├── bam_aux.c
│   ├── bam_aux.o
│   ├── bam.c
│   ├── bam_cat.c
│   ├── bam_cat.o
│   ├── bam_color.c
│   ├── bam_color.o
│   ├── bam_endian.h
│   ├── bam.h
│   ├── bam_import.c
│   ├── bam_import.o
│   ├── bam_index.c
│   ├── bam_index.o
│   ├── bam_lpileup.c
│   ├── bam_lpileup.o
│   ├── bam_mate.c
│   ├── bam_mate.o
│   ├── bam_md.c
│   ├── bam_md.o
│   ├── bam.o
│   ├── bam_pileup.c
│   ├── bam_pileup.o
│   ├── bam_plcmd.c
│   ├── bam_plcmd.o
│   ├── bam_reheader.c
│   ├── bam_reheader.o
│   ├── bam_rmdup.c
│   ├── bam_rmdup.o
│   ├── bam_rmdupse.c
│   ├── bam_rmdupse.o
│   ├── bamshuf.c
│   ├── bamshuf.o
│   ├── bam_sort.c
│   ├── bam_sort.o
│   ├── bam_stat.c
│   ├── bam_stat.o
│   ├── bamtk.c
│   ├── bamtk.o
│   ├── bam_tview.c
│   ├── bam_tview_curses.c
│   ├── bam_tview_curses.o
│   ├── bam_tview.h
│   ├── bam_tview_html.c
│   ├── bam_tview_html.o
│   ├── bam_tview.o
│   ├── bcftools
│   │   ├── bcf2qcall.c
│   │   ├── bcf2qcall.o
│   │   ├── bcf.c
│   │   ├── bcf.h
│   │   ├── bcf.o
│   │   ├── bcf.tex
│   │   ├── bcftools
│   │   ├── bcfutils.c
│   │   ├── bcfutils.o
│   │   ├── call1.c
│   │   ├── call1.o
│   │   ├── em.c
│   │   ├── em.o
│   │   ├── fet.c
│   │   ├── fet.o
│   │   ├── index.c
│   │   ├── index.o
│   │   ├── kfunc.c
│   │   ├── kfunc.o
│   │   ├── kmin.c
│   │   ├── kmin.h
│   │   ├── kmin.o
│   │   ├── libbcf.a
│   │   ├── main.c
│   │   ├── main.o
│   │   ├── Makefile
│   │   ├── mut.c
│   │   ├── mut.o
│   │   ├── prob1.c
│   │   ├── prob1.h
│   │   ├── prob1.o
│   │   ├── README
│   │   ├── vcf.c
│   │   ├── vcf.o
│   │   └── vcfutils.pl
│   ├── bedcov.c
│   ├── bedcov.o
│   ├── bedidx.c
│   ├── bedidx.o
│   ├── bgzf.c
│   ├── bgzf.h
│   ├── bgzf.o
│   ├── bgzip.c
│   ├── ChangeLog.old
│   ├── COPYING
│   ├── cut_target.c
│   ├── cut_target.o
│   ├── errmod.c
│   ├── errmod.h
│   ├── errmod.o
│   ├── examples
│   │   ├── 00README.txt
│   │   ├── bam2bed.c
│   │   ├── calDepth.c
│   │   ├── chk_indel.c
│   │   ├── ex1.fa
│   │   ├── ex1.sam.gz
│   │   ├── Makefile
│   │   ├── toy.fa
│   │   └── toy.sam
│   ├── faidx.c
│   ├── faidx.h
│   ├── faidx.o
│   ├── INSTALL
│   ├── kaln.c
│   ├── kaln.h
│   ├── kaln.o
│   ├── khash.h
│   ├── klist.h
│   ├── knetfile.c
│   ├── knetfile.h
│   ├── knetfile.o
│   ├── kprobaln.c
│   ├── kprobaln.h
│   ├── kprobaln.o
│   ├── kseq.h
│   ├── ksort.h
│   ├── kstring.c
│   ├── kstring.h
│   ├── kstring.o
│   ├── libbam.a
│   ├── Makefile
│   ├── Makefile.mingw
│   ├── misc
│   │   ├── ace2sam
│   │   ├── ace2sam.c
│   │   ├── ace2sam.o
│   │   ├── bamcheck
│   │   ├── bamcheck.c
│   │   ├── bamcheck.o
│   │   ├── blast2sam.pl
│   │   ├── bowtie2sam.pl
│   │   ├── export2sam.pl
│   │   ├── HmmGlocal.java
│   │   ├── interpolate_sam.pl
│   │   ├── Makefile
│   │   ├── maq2sam.c
│   │   ├── maq2sam-long
│   │   ├── maq2sam-short
│   │   ├── md5.c
│   │   ├── md5fa
│   │   ├── md5fa.c
│   │   ├── md5fa.o
│   │   ├── md5.h
│   │   ├── md5.o
│   │   ├── md5sum-lite
│   │   ├── md5sum-lite.o
│   │   ├── novo2sam.pl
│   │   ├── plot-bamcheck
│   │   ├── psl2sam.pl
│   │   ├── r2plot.lua
│   │   ├── sam2vcf.pl
│   │   ├── samtools.pl
│   │   ├── soap2sam.pl
│   │   ├── varfilter.py
│   │   ├── vcfutils.lua
│   │   ├── wgsim
│   │   ├── wgsim.c
│   │   ├── wgsim_eval.pl
│   │   ├── wgsim.o
│   │   └── zoom2sam.pl
│   ├── NEWS
│   ├── padding.c
│   ├── padding.o
│   ├── phase.c
│   ├── phase.o
│   ├── razf.c
│   ├── razf.h
│   ├── razf.o
│   ├── razip.c
│   ├── sam.c
│   ├── sam.h
│   ├── sam_header.c
│   ├── sam_header.h
│   ├── sam_header.o
│   ├── sam.o
│   ├── sample.c
│   ├── sample.h
│   ├── sample.o
│   ├── samtools
│   ├── samtools.1
│   ├── sam_view.c
│   ├── sam_view.o
│   └── win32
│       ├── xcurses.h
│       ├── zconf.h
│       └── zlib.h
├── scripts
│   ├── AddSequenceToCDR3File.pl
│   ├── barcoderep-filter.py
│   ├── GetFullLengthAssembly.pl
│   ├── trust-barcoderep-to-10X.pl
│   ├── trust-cluster.py
│   └── trust-stats.py
├── SeqSet.hpp
├── SimpleVector.hpp
├── trust4
├── trust-airr.pl
├── trust-barcoderep.pl
├── trust-simplerep.pl
└── trust-smartseq.pl

8 directories, 251 files
mourisl commented 2 years ago

I think you can just replace the trust-airr.pl, the other code should not affect the running results.

jolespin commented 2 years ago

I'll need to do a deep dive into this. I had trouble with some of the backend libraries after I ran make:

(trust4_env) -bash-4.2$ ./annotator
./annotator: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by ./annotator)
./annotator: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by ./annotator)
./annotator: /lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by ./annotator)
(trust4_env) -bash-4.2$ conda list
# packages in environment at /usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/trust4_env:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.18.1               h7f98852_0    conda-forge
ca-certificates           2021.10.8            ha878542_0    conda-forge
htslib                    1.14                 h9093b5e_0    bioconda
krb5                      1.19.2               hcc1bbae_3    conda-forge
libcurl                   7.80.0               h2574ce0_0    conda-forge
libdeflate                1.7                  h7f98852_5    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libgcc                    7.2.0                h69d50b8_2    conda-forge
libgcc-ng                 11.2.0              h1d223b6_11    conda-forge
libgomp                   11.2.0              h1d223b6_11    conda-forge
libnghttp2                1.43.0               h812cca2_1    conda-forge
libssh2                   1.10.0               ha56f1ee_2    conda-forge
libstdcxx-ng              11.2.0              he4da1e4_11    conda-forge
libzlib                   1.2.11            h36c2ea0_1013    conda-forge
ncurses                   6.2                  h58526e2_4    conda-forge
openssl                   1.1.1l               h7f98852_0    conda-forge
perl                      5.32.1          1_h7f98852_perl5    conda-forge
samtools                  1.14                 hb421002_0    bioconda
tk                        8.6.11               h27826a3_1    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
zlib                      1.2.11            h36c2ea0_1013    conda-forge
jolespin commented 2 years ago

Got it to work.

What I did was:

  1. conda create -n trust4_env -c bioconda trust4 -y to create a new conda environment with all the dependencies
  2. It looks like the conda recipe excludes a few of the files:
    $ ls trust4-1.0.6-h2e03b76_0/bin
    BuildDatabaseFa.pl  fastq-extractor     trust-smartseq.pl
    BuildImgtAnnot.pl   run-trust4      trust4
    annotator       trust-barcoderep.pl
    bam-extractor       trust-simplerep.pl

    So I symlinked the missing files from the proper release:

# Download v1.0.6
wget https://github.com/liulab-dfci/TRUST4/archive/refs/tags/v1.0.6.zip

# Unzip it
unzip v1.0.6.zip

# Symlink files
ln -sf /usr/local/devel/ANNOTATION/jespinoz/Packages/TRUST4/TRUST4-1.0.6/*.pl /usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/trust4_env/bin/

ln -sf /usr/local/devel/ANNOTATION/jespinoz/Packages/TRUST4/TRUST4-1.0.6/scripts/* /usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/trust4_env/bin/
mourisl commented 2 years ago

Thank you! I just realized the bioconda receipt might miss some scripts introduced in the later versions. I will update it later.

jolespin commented 2 years ago

Thanks for working through that with me! I hope it was mutually helpful.

Do you recommend any post processing pipelines or interpreting the output? Apologies in advance if this out of scope.

mourisl commented 2 years ago

In the scripts folder, we have trust-stats.py to compute some basic diversity statistics for each sample. You can also use tools like VDJTools, Immunarch for downstream analysis.

PedroMilanezAlmeida commented 2 years ago

Maybe a good idea to re-open this?

I got the same error ./annotator: /lib64/libstdc++.so.6: version 'GLIBCXX_3.4.21' not found (required by ./annotator) and got it to work after following the instructions from here: https://github.com/liulab-dfci/TRUST4/issues/108#issuecomment-1000614163.

I don't know if it is a good idea to add the instructions above to the "how to install" or to automate it (not sure if possible).

PS: thanks for making and publishing this tool!

mourisl commented 2 years ago

@PedroMilanezAlmeida This seems more of the conda environment issue. Let me see whether I can update bioconda's receipt to resolve this.