General guidelines for Pgen

jeremycfd commented 6 years ago

Hi Quentin,

I've been playing around with IGoR a bit after having read the paper and I'm wondering if you can give some suggestions about how to go through the process of estimating Pgen for small datasets:

1) Say for instance we have anywhere from 50 to a few hundred human single-chain TCR sequences that are also epitope-specific. Would you recommend simply using the model that comes with IGoR to estimate Pgen, or do you anticipate any improvement in first using -infer to update the model, even though there are relatively few sequences and they are not representative of random selection from the repertoire?

2) I recall from the paper that you recommend considering at least 50 scenarios for each somatic recombination event. But when I set --scenarios to any value, I can't see evidence in the logs or the output that the number of scenarios I specified is actually being used. Perhaps I'm looking in the wrong place. Can you advise? Or perhaps estimating Pgen doesn't benefit from considering more than the 10 most likely scenarios?

Thanks!

qmarcou commented 6 years ago

Hi Jeremy, Sorry for the late reply. These are two interesting questions, see my answers below:

From what we have observed only gene usage and/or alleles sequences vary among the recombination machinery of different individuals. If you were to have a few hundreds of out of frame sequences I would have recommended to re-learn only the gene usage distributions. Now in your case the best would probably be to use the provided model as such. Anyway, these gene usage variability is not what's controlling most of Pgen variations.
There are two different things here: the number of scenarios explored by IGoR and the number of scenarios that IGoR outputs. Even by specifying --scenarios 50 IGoR will explore many more of them, however only 50 of them will be written into file in the output directory. What is controlling the number of scenarios IGoR explores during an Expectation-Maximization step are the --P_ratio_thresh and/or --MLSO commands. In theory the more scenarios have been explored the best, in practice there is a balance with runtime, but the probability ratio threshold should not be set too high.

Hope this answers your questions

jeremycfd commented 6 years ago

Thanks for the explanation! I find that setting --P_ratio_thresh to 0.0 causes issues (every Pgen comes back as nan) but I can set it to extremely small values (e.g., 1E-10) without issue. What is the default P_ratio_thresh? (Perhaps that could be added to man igor).

Cheers.

qmarcou commented 6 years ago

Mmm that is odd, as explained in here setting it to 0.0 should explore all possible scenarios (yielding a very slow execution time) at first thought I don't see why this should return nan. Could you attach a sample of the pgen, and inference_logs files for debugging purposes? The default value for this parameter is 10^{-5}, I actually thought it was in the README, this will be added, thanks for pointing this out! Thanks!

qmarcou / IGoR

General guidelines for Pgen #4