mikessh / vdjtools-doc

THIS REPOSITORY IS DEPRECATED, FOR ISSUE TRACKER GO TO
https://github.com/mikessh/vdjtools
1 stars 2 forks source link

AA annotation, usage of VJ subregion for VDJ recombinations #3

Closed rspreafico closed 8 years ago

rspreafico commented 8 years ago

Hi,

congrats on a great piece of software!

I have TCRB data (so VDJ recombination) from ImmunoSEQ. I get a lot of skipped lines when I try to use CalcCdrAaProfile with VD-junc, D-germ and DJ-junc. If I use VJ-junc instead, I don't get any skipped lines. There is a warning in the documentation that ImmunoSEQ data sometimes lacks the full J segment, but this seems to be different, more related to D segments.

From the documentation, it is not clear to me what the behavior of VJ-junc is in the presence of VDJ recombination. Does it still bin the whole V-D-J junction region, or does it take into account only the portion of DNA coming from the V and J segments (i.e. disregarding DNA coming from the D segment)? Would it be possible to edit the documentation to make this point clear?

Thank you, Roberto

mikessh commented 8 years ago

Hello and thanks!

First, I've added an ImmunoseqV3 parser in the develop format. I'm unsure whether the latest Immunoseq data is parseable with standard Immunoseq parser. The develop version has also a completely re-written CalcCdrAaProfile and Annotate routines, but the docs are only those provided in the console with -h argument.

VJ-junc option (as well as "NDN size" in basic statistics) are always computed for NTs/AAs between V segment end and J segment start, no matter whether D segment was identified or not. The V-D and D-J (and "insert size" in basic statistics) are only computed for clonotypes with identified D segment, and those without D segment are skipped.

rspreafico commented 8 years ago

Thank you! This is very helpful!

rspreafico commented 8 years ago

I downloaded 1.1.2-SNAPSHOT and noticed that there is an ImmunoSeqV3 format now. I tried the previous v1.1.1. with ImmunoSeq format, and the new one with both ImmunoSeq and ImmunoSeqV3 format. The ImmunoSeqV3 does not parse my V3 files. The ImmunoSeq format (for both 1.1.1 and 1.1.2-SNAPSHOT) does parse my V3 files, however many clonotypes get discarded.

[Mon Sep 19 17:20:06 PDT 2016 Convert] Reading sample(s)
[Mon Sep 19 17:20:06 PDT 2016 Convert] 12 sample(s) loaded
[Mon Sep 19 17:20:06 PDT 2016 SampleStreamConnection] Loading sample 4651_Day_0
[WARNING] Some of the essential fields are bad/missing for the following clonotype string (displaying first 5 warnings)
NO_CDR3AA,NO_V:
TTCACATCAATTCCCTGGAGCTTGGTGACTCTGCTGTGTATTTCTGTGCCAGCAGCCCAGGGGAACACTGAAGCTTTCTTTGGACAA     138 9.228732119331519E-4    37  TCRBV03 TCRBV03             TCRBV03-01,TCRBV03-02       TCRBD01-01*01   TCRBD01 TCRBD01-01  01          TCRBJ01-01*01   TCRBJ01 TCRBJ01-01  01              4   0   4   3   0   1   44  -1  57  -1  62  138 Out VDJ null    null    null    null    null    null    null    null    null    null    null    null
NO_V:
TTGGAGTCGGCTGCTCCCTCCCAAACATCTGTGTACTTCTGTGCCAGCAGAAAAGCTGGCGCCTACAATGAGCAGTTCTTCGGGCCA CASRKAGAYNEQFF  45  3.0093691693472343E-4   42  TCRBV06 TCRBV06             TCRBV06-02,TCRBV06-03                       TCRBD01,TCRBD02 TCRBD01-01,TCRBD02-01       TCRBJ02-01*01   TCRBJ02 TCRBJ02-01  01              6   7   9   0   1   2   39  50  57  60  61  45  In  VDJ null    null    null    null    null    null    null    null    null    null    null    null
NO_CDR3AA:
TCATCCTGAGTTCTAAGAAGCTCCTTCTCAGTGACTCTGGCTTCTATCTCTGTGCCTGGCCAGAATGAAAAACTGTTTTTTGGCAGT     7   4.6812409300956976E-5   31  TCRBV30-01*01   TCRBV30 TCRBV30-01  01              TCRBD01-01*01   TCRBD01 TCRBD01-01  01          TCRBJ01-04*01   TCRBJ01 TCRBJ01-04  01              5   1   4   5   0   5   50  59  60  -1  63  7   Out VDJ null    null    null    null    null    null    null    null    null    null    null    null
NO_V:
GTGACCAGTGCCCATCCTGAAGACAGCAGCTTCTACATCTGCAGTGCTCCGGCAGGTCGCATGAACACTGAAGCTTTCTTTGGACAA CSAPAGRMNTEAFF  54  3.6112430032166814E-4   42  TCRBV20 TCRBV20             TCRBV20-01,TCRBV20-or09_02      TCRBD01-01*01   TCRBD01 TCRBD01-01  01  TCRBJ01-01*01   TCRBJ01 TCRBJ01-01  01              5   4   4   4   5   0   39  48  52  56  61  54  In  VDJ null    null    null    null    null    null    null    null    null    null    null    null
NO_CDR3AA:
TCCTGAGTTCTAAGAAGCTCCTTCTCAGTGACTCTGGCTTCTATCTCTGTGCCTGGAGGCCCCCGCCTCGAGCAGTACTTCGGGCCG     5   3.343743521496927E-5    34  TCRBV30-01*01   TCRBV30 TCRBV30-01  01                              TCRBD01,TCRBD02 TCRBD01-01,TCRBD02-01       TCRBJ02-07*01   TCRBJ02 TCRBJ02-07  01              3   0   10  0   8   6   47  -1  58  60  68  5   Out VDJ null    null    null    null    null    null    null    null    null    null    null    null
[Mon Sep 19 17:20:09 PDT 2016 ClonotypeStreamParser] Finished parsing. 1 header and 47733 bad line(s) were skipped.
[Mon Sep 19 17:20:09 PDT 2016 SampleStreamConnection] Loaded sample 4651_Day_0 with 85785 clonotypes and 96656 cells. Memory usage: 0 of 1 GB
[Mon Sep 19 17:20:09 PDT 2016 Convert] Processed 1 sample(s).. Writing output

Funny enough, if I read the original TSV files and rewrite them out through R, and then feed those to vdjtools, no warnings anymore:

[Mon Sep 19 17:19:34 PDT 2016 Convert] Reading sample(s)
[Mon Sep 19 17:19:34 PDT 2016 Convert] 12 sample(s) loaded
[Mon Sep 19 17:19:34 PDT 2016 SampleStreamConnection] Loading sample 4651_Day_0
[Mon Sep 19 17:19:37 PDT 2016 ClonotypeStreamParser] Finished parsing. 1 header and 0 bad line(s) were skipped.
[Mon Sep 19 17:19:37 PDT 2016 SampleStreamConnection] Loaded sample 4651_Day_0 with 111757 clonotypes and 149533 cells. Memory usage: 0 of 1 GB
[Mon Sep 19 17:19:37 PDT 2016 Convert] Processed 1 sample(s).. Writing output

However, some clonotypes get still silently discarded by the parser, even in the absence of errors. For example, for this file, there are originally 134344 clonotypes. WIth the first method (ImmunoSeq format, gives some errors) I get 85786, with the second method (R, then ImmunoSeq format, no explicit errors) I get 111758. With ImmunoSeqV3, it won't parse at all.

Also, I noticed that the new AA annotation function does not allow to select the bin size of regions anymore.

Thanks,

Roberto

rspreafico commented 8 years ago

PS: the code to import/export from R that I used is the following:

library(tidyverse)
input <- list()
for(f in list.files(path=".", pattern="*tsv")) input[[f]] <- read_tsv(f) 
for(i in names(input)) input[[i]] %>% write_tsv(path = i)
rspreafico commented 8 years ago

I have to take back the statement that after reading/writing the data with R, there are missing clonotypes in the absence of errors. I hadn't realized that vdjtools internally consolidates some clonotypes - the count column total is identical for input and output of vdjtools. So it appears that having the data go through R fixes the (v2) parser errors. All R does is inserting NAs where empty cells originally were.