matsengrp / sumrep

Summary statistics for repertoires
16 stars 6 forks source link

Comparing Substitution and Targeting Models #6

Closed BrandenOlson closed 5 years ago

BrandenOlson commented 7 years ago

Substitution and targeting models for motifs are implemented in the shazam R package. Our idea is to infer substitution, mutability, and targeting models for two repertoires and then compare them by a sum (or average) of absolute differences of the rates over all motifs (or motif, target-base pairs).

The createSubstitutionMatrix function is particular in how it expects the input data to be formatted. For instance, it wants the sequence and germline columns to be "IMGT-gapped". Is there an R function which does this gapping programmatically, given the lists of naive and mature sequences, and perhaps also a list of inferred V-genes? If not, does anyone know of a clean way to do this?

javh commented 7 years ago

There isn't an R function I'm aware of, but there is a python one: changeo.Parsers.gapV

If you have the V call, the alignment start position in the reference, and the query alignment start position, then you can walk through the IMGT-gapped reference sequence and add the corresponding gaps to the query sequence.

The biggest obstacle is insertions. You'd need to recognize them prior to adding the IMGT gaps and delete those positions from the query sequence. Otherwise the IMGT numering will be off.

matsen commented 7 years ago

Thanks a bunch for the help, @javh ! We're still sorting out our approach.

BrandenOlson commented 6 years ago

According to our spreadsheet, we want to examine and compare statistics using a repertoire's inferred substitution matrix (obtained via shazam::createSubstitutionMatrix) as well as the targeting matrix (obtained via shazam::createTargetingMatrix). It seems that createTargetingMatrix requires a substitution model as well as a mutability model. The mutability model is obtained via shazam::createMutabilityMatrix, which also requires a substitution model as input.

It seems that for each example I have looked at, the mutability model returned by createMutabilityMatrix depends of course on the dataset input, but yields the same model regardless of the supplied substitution model. This seems strange since the substitution model is a mandatory argument, and the source code does seem to incorporate the model in its result. Nonetheless I have yet to find a counterexample using datasets of various sizes, and was hoping for an explanation on either why I am mistaken, or why the model is not incorporated. Thanks!

javh commented 6 years ago

I don't know. That seems wrong. I'll ask around and see if anyone here knows what's going on.

BrandenOlson commented 6 years ago

Much appreciated, @javh!

javh commented 6 years ago

@julianqz just a did a test using different substitution matrices as arguments to createMutabilityMatrix and couldn't reproduce the problem. He got different models as output.

Do you have an example you can send us that causes this to happen?

julianqz commented 6 years ago

I attached the script (should be self-explanatory with the comments) and mock data I used for testing in the zip file below:

compare.zip

BrandenOlson commented 6 years ago

Thanks, folks - in the meantime, here is an example of the problem on my end:

compare_BJO.zip

(The package version is ‘0.1.8’)

javh commented 6 years ago

@julianqz checked the data @BrandenOlson posted and got the same results wherein the substitution model didn't impact the output. He also noted that only 17 of the sequences appeared mutated in the data set. We still don't know what's going on, but that seems suspicious.

BrandenOlson commented 6 years ago

@javh and @julianqz : I just did a quick check and it appears to me that there are 17 sequences that are unchanged from naive to mature, but 4,891 that are different in some way. These are the commands I executed on the example I sent:

> dat_1[dat_1$naive_seq != dat_1$mature_seq, ] %>% nrow
[1] 4891
> dat_1[dat_1$naive_seq == dat_1$mature_seq, ] %>% nrow
[1] 17

In the meantime, since createMutabilityMatrix really should depend on the substitution model, I will go ahead and incorporate this in the sumrep functions.

julianqz commented 6 years ago

Oops! Sorry, my bad! Must have confused == with != somehow.

BrandenOlson commented 6 years ago

@javh / @julianqz - any updates on the createMutabilityMatrix input issue?

javh commented 6 years ago

Not yet. Talked to Steve about it and he said the substitution model would only be used when there are replacement mutations. By default, createMutabilityMatrix should consider both replacement and silent mutations (model="RS"), so that seems unlikely given the number of sequences.

javh commented 6 years ago

So I took a look at this and it is due to differences in the algorithm for the model="RS" and model="S", but I had them backwards before. The substitution model isn't used for model="RS". I got different answers using model="S". I tracked it down to the background counts being the same with model="RS" and asked Gur about it. He said:

There is no need for a substitution model to create a mutability model in the "RS" mode. There are some delicate issues when counting possibilities for stop codons, but putting aside this relatively minor issue, there is no need for a substitution model for this task. We need it in the "S" mode as there are many situations in which only a certain mutation will lead to an S mutation and others not. For example, UUA can mutate to UUG and remains Leucine but if it mutates to UUT it will encode Phenylalanine. This is why we need to take into account in the background only the relevant options of the substitution matrix, and in the foreground (count) we count the substitution rate that has occurred. On the other hand, in the "RS" mode, any mutation is valid (once again, stop codons are currently overlooked) and this is why we can simply use integers there.

I'll update the documentation to reflect this.

BrandenOlson commented 5 years ago

Thanks for clearing that up, @javh! (sorry for the delayed response)