immunomind / immunarch

🧬 Immunarch: an R Package for Fast and Painless Exploration of Single-cell and Bulk T-cell/Antibody Immune Repertoires
https://immunarch.com
Apache License 2.0
307 stars 65 forks source link

Support for bulk TCR deconvolution (TRUST4) #161

Closed Michael-Geuenich closed 11 months ago

Michael-Geuenich commented 3 years ago

Hi, I would really like to use some of the functionality in immunarch.

I have TCR sequences from bulk RNA-sequencing data that I inferred with TRUST4 (https://github.com/liulab-dfci/TRUST4). According to the authors this outputs a VDJtools compatible file. However, repLoad does not recognize the input. Would it be possible to add support for this type of output?

The output from TRUST4 looks something like this (usually with many more rows):

#count  frequency   CDR3nt  CDR3aa  V   D   J   C   cid cid_full_length
152 0.2771141   TGTCAGCCGTATTTTATCCGCTCACTTTC   CQQYFATSPLTF    IGKV4-1*01  .   IGKJ4*01    IGKC    assemble4   1
116 0.212622    TGTCATCAAATATTATACTTTCACACTTTC  CQQYYSTFSLTF    IGKV4-1*01  .   IGKJ4*01    IGKC    assemble18  1

One issue is that immunarch does not appear to be able to deal with the # in count. But even if I remove that manually, I've realized that another issue is that this file does not contain vend in the column names. Changing

else if (str_detect(tolower(l), "cdr3nt") && str_detect(tolower(l), "vend") && str_detect(tolower(l), "v")) {
    res_format <- "vdjtools"
  }

in R/io.R to

else if (str_detect(tolower(l), "cdr3nt") && str_detect(tolower(l), "v")) {
    res_format <- "vdjtools"
  }

Fixes the issue for some files, but in the case of others I get the following error:

Error: Assigned data `df[[.dstart]] - df[[.vend]] - 1` must be compatible with existing data.
x Existing data has 3 rows.
x Assigned data has 0 rows.
i Only vectors of size 1 are recycled.

I would happy to be a beta tester for this functionality/help with implementing it if that is of interest.

vadimnazarov commented 3 years ago

Hi Michael,

We implemented a support for TRUST4 in the latest pre-release version of Immunarch. Please install it and let us know if it works. Instructions: https://immunarch.com/articles/v1_introduction.html The version is pre-released so it might work inconsistently. We will be glad to fix all the issues promptly to stabilize the TRUST4 parser.

Best, Vadim

Michael-Geuenich commented 3 years ago

Hi Vadim,

Thanks for the quick reply! Works like a charm for the most part. My only two things to note:

  1. I have one sample where TRUST returned no TCR sequences. This results in the following warning while reading in:

    Can't determine the type of V(D)J recombination. No insertions will be presented in the resulting data table.  [!] Warning: zero clonotypes found, skipping
  2. I get the following warning for a second sample

    Can't determine the type of V(D)J recombination. No insertions will be presented in the resulting data table.

And the sample looks something like this (note that I've had to change the sequence due to privacy reasons).

#count  frequency   CDR3nt  CDR3aa  V   D   J   C   cid cid_full_length
56  1.000000e+00    TGTGCGTGGAGCTGGAACCAGCTGCTGACCTTTGGTTCGGCGGACTTCTGG CAWGWNQLLTFGSADFW   IGHV3-7*01  IGHD2-2*01  IGHJ5*01    IGHA1   assemble7   0
2   1.000000e+00    TGTTTCAATTACGCTACCCCGTGGTCGTTC  CFNYATPWSF  IGKV1-39*01 .   IGKJ1*01    IGKC    assemble47  0

Not sure why the second sample is returning a warning.

The only other suggestion I have is to also create a dataframe listing all the samples that were not included in the data table and the reason why. I have over 500 samples so it is a bit tedious to manually scroll through all the messages. But this a really minor thing.

Thanks for the help! Michael