Open kamurani opened 3 months ago
To clarify, there is indeed a reference file accessible by default, but I am confused by the formatting of this file and what information within it is required for gliph
to work.
For example, the reference file used in my installation yields:
$ tail ~/path/to/my/gliph/gliph/db/tcrab-naive-refdb-pseudovdjfasta.fa
>TRBV9,TRBJ2-7,CSSSVDPGGPLHEQYF;TRBV9 300 0;;TRBJ2-7 30 0;CSSSVDPGGPLHEQYF;;;;;;;;;;;;;;;
CSSSVDPGGPLHEQYF
>TRBV9,TRBJ2-7,CSSSVDQGAPYEQYF;TRBV9 300 0;;TRBJ2-7 30 0;CSSSVDQGAPYEQYF;;;;;;;;;;;;;;;
CSSSVDQGAPYEQYF
>TRBV9,TRBJ2-7,CSSSVGPSGSYEQYF;TRBV9 300 0;;TRBJ2-7 30 0;CSSSVGPSGSYEQYF;;;;;;;;;;;;;;;
CSSSVGPSGSYEQYF
>TRBV9,TRBJ2-7,CSSSVGQGAPLYEQYF;TRBV9 300 0;;TRBJ2-7 30 0;CSSSVGQGAPLYEQYF;;;;;;;;;;;;;;;
CSSSVGQGAPLYEQYF
>TRBV9,TRBJ2-7,CSSSVSDRGWTYEQYF;TRBV9 300 0;;TRBJ2-7 30 0;CSSSVSDRGWTYEQYF;;;;;;;;;;;;;;;
CSSSVSDRGWTYEQYF
The actual sequence component of each FASTA record appears to be a CDR3b sequence; but I am not sure if the >header
component is also parsed. Also, these records appear to be duplicates of eachother. I would appreciate any clarification on this if possible.
For the
gliph-group-discovery
command, an option can be supplied as outlined in the documenation with the following example:--refdb=mouseDB.fa
.However, there is no example file provided on what this file should contain. I am assuming it would be a whole set of FASTA-formatted sequences of whole TCRs (alpha or beta chain?) or just the CDR3 region.
This needs clearer examples in the documentation, and would be helpful to see the original file that is used by default by GLIPH (for the human "background repertoire").