immunoengineer / gliph

Grouping of Lymphocyte Interactions by Paratope Hotspots
GNU General Public License v3.0
90 stars 30 forks source link

Clarification on `--refdb` needed #17

Open kamurani opened 3 months ago

kamurani commented 3 months ago

For the gliph-group-discovery command, an option can be supplied as outlined in the documenation with the following example: --refdb=mouseDB.fa.

However, there is no example file provided on what this file should contain. I am assuming it would be a whole set of FASTA-formatted sequences of whole TCRs (alpha or beta chain?) or just the CDR3 region.

This needs clearer examples in the documentation, and would be helpful to see the original file that is used by default by GLIPH (for the human "background repertoire").

kamurani commented 3 months ago

To clarify, there is indeed a reference file accessible by default, but I am confused by the formatting of this file and what information within it is required for gliph to work.

For example, the reference file used in my installation yields:

$ tail ~/path/to/my/gliph/gliph/db/tcrab-naive-refdb-pseudovdjfasta.fa
>TRBV9,TRBJ2-7,CSSSVDPGGPLHEQYF;TRBV9 300 0;;TRBJ2-7 30 0;CSSSVDPGGPLHEQYF;;;;;;;;;;;;;;;
CSSSVDPGGPLHEQYF
>TRBV9,TRBJ2-7,CSSSVDQGAPYEQYF;TRBV9 300 0;;TRBJ2-7 30 0;CSSSVDQGAPYEQYF;;;;;;;;;;;;;;;
CSSSVDQGAPYEQYF
>TRBV9,TRBJ2-7,CSSSVGPSGSYEQYF;TRBV9 300 0;;TRBJ2-7 30 0;CSSSVGPSGSYEQYF;;;;;;;;;;;;;;;
CSSSVGPSGSYEQYF
>TRBV9,TRBJ2-7,CSSSVGQGAPLYEQYF;TRBV9 300 0;;TRBJ2-7 30 0;CSSSVGQGAPLYEQYF;;;;;;;;;;;;;;;
CSSSVGQGAPLYEQYF
>TRBV9,TRBJ2-7,CSSSVSDRGWTYEQYF;TRBV9 300 0;;TRBJ2-7 30 0;CSSSVSDRGWTYEQYF;;;;;;;;;;;;;;;
CSSSVSDRGWTYEQYF

The actual sequence component of each FASTA record appears to be a CDR3b sequence; but I am not sure if the >header component is also parsed. Also, these records appear to be duplicates of eachother. I would appreciate any clarification on this if possible.