qmarcou / IGoR

IGoR is a C++ software designed to infer V(D)J recombination related processes from sequencing data. Find full documentation at:
https://qmarcou.github.io/IGoR/
GNU General Public License v3.0
47 stars 25 forks source link

Remove superfluous IMGT information in TCR beta model parms file #72

Open zacmon opened 1 year ago

zacmon commented 1 year ago

The default TCR beta model_parms.txt contains extraneous information from the IMGT where, ideally, only the name of the allele should be. Compare this to the model_parms.txt files for IGL, IGK, IGH, and TCR alpha. While this extra information doesn't present a problem for IGoR to my knowledge, it has consequential downstream effects. In particular, OLGA, and therefore SONIA or soNNia, requires only the name of the allele to precede the allele sequence in the model_params.txt, which is taken as the final_parms.txt file from a custom-trained IGoR model. Notably the default TCR beta OLGA model doesn't have this extra IMGT information.

Training a TCR beta model without supplying a model_parms.txt would ensue in the final_parms.txt of the custom model being roughly identical to the default model_parms.txt file with the extra IMGT information present (but with a different error rate). At this moment in time, OLGA does not raise an exception if the name of the allele is not the only piece of information preceding the allele sequence, so a user with a custom TCR beta model from IGoR would not know what the problem is. While there are fixes to be made in OLGA to ensure the user knows when parsing/input file errors are encountered, it would set up everyone for success if the superfluous IMGT information was removed from the default TCR beta mode_parms.txt.

I've attached what I believe should be the default TCR beta model_parms.txt, with the IMGT information removed for the alleles: model_parms.txt.

Thanks and take good care, Zach