missing value error when calling genogam()

RichardJActon commented 4 years ago

Hi, I'm getting an error that I'm struggling to interpret. I invoked the genogam() function on a genogam data object that I build earlier in the following fashion:

library(GenoGAM)
settings <- GenoGAMSettings(
    optimMethod = "L-BFGS-B", optimControl = list(maxit = 100L),
    estimControl = list(eps = 1e-10, maxiter = 500L)
)
gdd <- readRDS("GenoGamDataObjectPath.Rds")
BiocParallel::register(BiocParallel::MulticoreParam(worker = 4))
res <- GenoGAM::genogam(gdd, intervalSize = 200)

and got this output:

INFO [2020-06-26 15:39:35] Initializing the model
INFO [2020-06-26 15:39:35] Done
INFO [2020-06-26 15:39:35] Estimating hyperparameters
INFO [2020-06-26 15:39:38] The design matrix has the same number of samples and coefficients. The top regions for cross-validation will be determined by the highest sum of counts.
  Nelder-Mead direct search function minimizer
function value for initial parameters = 72397.681842
  Scaled convergence tolerance is 0.00107881
Stepsize computed as 0.639693
BUILD              3 72497.354963 69931.476276
EXTENSION          5 72397.681842 69127.143323
Error in while (lldiff > epsilon & normGrad > epsilon & k <= control$maxiter) { :

missing value where TRUE/FALSE needed

I was trying to figure out what lldiff, epsilon, normGrad, k, control$maxiter were to see If I have failed to set a value for them somewhere so I took a look at the source to try and figure out what might be causing it and found this line in 2 places:

https://github.com/gstricker/GenoGAM/blob/0ac7b967b6926bc6810909bec3c6926a91128c2e/R/misc.R#L130

https://github.com/gstricker/GenoGAM/blob/0ac7b967b6926bc6810909bec3c6926a91128c2e/R/estimation.R#L170

but I haven't been about to figure out why any of these values might be missing, Do you have any suggestions?

gstricker commented 4 years ago

Hi, I'll get a look over the weekend and see what can be done. There is a call to a Cpp function that actually runs the inner optimization here https://github.com/gstricker/GenoGAM/blob/0ac7b967b6926bc6810909bec3c6926a91128c2e/R/estimation.R#L180 If that came out NA, then lldiff might resolve to NA.

The log reads like Nelder-Mead could have picked a very unstable parameter combination for the next iteration given the optimization parameters. For a quick check, you could try successively increasing eps in the estimControl settings and see if that helps.

gstricker commented 4 years ago

Looking at the above way you have constructed ggd, it seems you are loading a previous R object which contains the GenoGAMDataSet. Thus, your settings have no effect, as all settings have to be given as a GenoGAMSettings object at construction time (see the README) and then modified directly on the object (in this case ggd).

Could you please show me how you have constructed the initial object? Also, could you please switch to debugger logging (set flog.threshold() to DEBUG) and run the analysis again? Thanks!

RichardJActon commented 4 years ago

Ah OK - let me try it with the settings supplied to the objects at the time of their construction - I did not set them when I built the data objects. I assumed that the genogam function would pull the settings object from the global environment but that does not track with a closer reading of the documentation. I assumed this as it made sense to me that you would set model fitting parameters at the time you call the model fitting function, not on data object construction. Does this mean that you have to rebuild the data objects if you want to change model fitting parameters or are their methods for altering existing data objects to change the settings? Thanks for taking the time to help me out.

RichardJActon commented 4 years ago

OK so my analysis now looks like this:

library(GenoGAM)
futile.logger::flog.threshold("DEBUG")
settings <- GenoGAM::GenoGAMSettings(
    optimMethod = "L-BFGS-B", optimControl = list(maxit = 100L),
    estimControl = list(eps = 1e-5, maxiter = 500L)
)
BiocParallel::register(BiocParallel::MulticoreParam(worker = 4))
gdd <- GenoGAM::GenoGAMDataSet(
    desmat, design = formula,
    directory = pathToData,
    chunkSize = 50000, overhangSize = 1000,
    settings = settings
)
gdd <- GenoGAM::computeSizeFactors(gdd)
res <- GenoGAM::genogam(gdd, intervalSize = 200)

The object building appears to work fine and the settings changes are reflected in the output of GenoGAMSettings() but when I run the model fitting I'm eventually getting an error like the one below. So I'm assuming that you are right about this meaning an unstable parameter combination is being chosen.

DEBUG [2020-06-29 23:20:56] ||gr|| = NaN
DEBUG [2020-06-29 23:20:56] Likelihood = Inf

I changed the eps to 1e-5 when I build the initial genogam data object, is there any reason why just changing the eps parameter directly in the data object like this would be an issue so I don't have to rebuild the data object to try different eps values?:

gdd@settings@estimControl$eps <- 1e-1

Also do you have any other advice for finding stable parameters? Thanks again

[Update]: I tried gdd@settings@estimControl$eps <- 1e-1 and also eventually hit a NaN result

gstricker commented 4 years ago

There are a couple of things going on under the hood, but the problem is usually rooted in low count regions and the underlying negative-binomial distribution.

The joint estimation of overdispersion parameter /theta and smoothness parameter /lambda is done in two steps due to the fact, that it can't be easily expressed analytically, and thus, no derivatives exist. The estimation procedure involves the maximization of the joint likelihood function over a set of /lambda and /theta. However, without a derivative, but a cross-validation scheme instead. A very small /eps can result in difficulties to converge to the maximum and shooting off in the opposite direction on the /lambda - /theta plane. That is, picking combinations of both parameters, which in low-count regions and given the negative-binomial distributions will quickly become unstable.

Generally, the defaults chosen should work fine as they are chosen for stability. Nelder-Mead as an optimization function does not require a derivative and should be slow but robust. L-BGFS-B can be faster but is less stable. According to documentation, it works without derivatives, although I'm not sure what heuristic it uses since L-BGFS generally does require a derivate.

Possible solutions are:

Use Nelder-Mead (default) and increase /eps. However, anything beyond 1e-3 doesn't sound reasonable.
Decrease interval size, in particular, if you don't have replicates. One of the above posts shows the message The design matrix has the same number of samples and coefficients which usually happens if one does not use replicates. interval size is the size of the intervals that are used for cross-validation. If they are too big and left out during cross-validation without replicates to fill in it leads to heavy interpolation which can be even more problematic in low-count regions.
In case not given and possible, supply replicates, which is generally best-practice

Hope that helps. What dataset and with which design matrix are you working?

RichardJActon commented 4 years ago

Interesting, I tried a range of internal sizes from 20-200 with similar results. I'm working with C. elegans data for PolII and a non-standard histone (paired-end). Maybe I'm missing something that should be obvious in my experimental design. I have 3 replicates for my input background and 3 for my exposure as below:

The design formula for my test case is: ~ s(x) + s(x, by = input) and this is the design matrix:

ID	file	paired	input	polII	treatment
Input_WT1_UV	K002000093_53939_sorted.bam	TRUE	1	0	1
WT1_UV_RNAPII_ChIP	K002000093_53947_sorted.bam	TRUE	0	1	1
Input_WT2_UV	K002000093_53963_sorted.bam	TRUE	1	0	1
WT2_UV_RNAPII_ChIP	K002000093_53971_sorted.bam	TRUE	0	1	1
Input_WT3_UV	K002000093_53987_sorted.bam	TRUE	1	0	1
WT3_UV_RNAPII_ChIP	K002000093_53995_sorted.bam	TRUE	0	1	1

There are some extra columns as I have some more complex designs to do subsequently. I was testing by just fitting a model for the RNAPII levels relative to my input background.

I'll give Nelder-Mead a shot

Thanks

gstricker commented 4 years ago

Ok, I think I have the answer to the current problem but have to fix some functionality down the line.

The design matrix and the formula as of now models the data as a combination of a baseline and input which will probably result in pretty much a flat signal. To model the RNAPII levels relative to the input background. You just need one additional column, the polII column and the formula ~s(x) + s(x, by = polII). This will produce two smooth signals, one for the background (s(x)) and one for the RNAPII levels relative to the background (s(x, by = polII)). See the config file of the experiment data set attached to the package: https://github.com/gstricker/GenoGAM/blob/d4c45e45cbfbb2c4c8b1806fe4e9b3e10b94e0ed/inst/extdata/Set1/experimentDesign.txt

Thus, could you please try it with a modified config file? You would only have 4 columns: the three mandatory and a fourth would be your polII column. The rest should be deleted. I am fairly certain this should work. Btw. you don't need a column with all 1s, this is handled automatically internally.

If that is the case, then it means, that the functions that handle subsetting of the data according to the formula get for some reason confused if there are more columns present in the design matrix than expected by the formula. I would need to look into that. Let me know if it works with my suggestion of a modified config file/design matrix and I will know where to look to fix it.

Thanks!

RichardJActon commented 4 years ago

Right - So I was/am clearly a little confused about how to specify the design. I thought that you would have to explicitly specify which samples were input controls for some reason - clearly, this can be inferred to be the 0 samples if I give it the polII column. I'll run an example by you to see if I've got it now:

So taking the slightly more complex question: 'What are the RNAPII levels relative to combined background and HTZ levels' with this matrix:

ID	file	paired	polII	HTZ
Input_WT1_UV	K002000093_53939_sorted.bam	TRUE	0	0
WT1_UV_RNAPII_ChIP	K002000093_53947_sorted.bam	TRUE	1	0
WT1_UV_HTZ_ChIP	K002000093_53955_sorted.bam	TRUE	0	1
Input_WT2_UV	K002000093_53963_sorted.bam	TRUE	0	0
WT2_UV_RNAPII_ChIP	K002000093_53971_sorted.bam	TRUE	1	0
WT2_UV_HTZ_ChIP	K002000093_53979_sorted.bam	TRUE	0	1
Input_WT3_UV	K002000093_53987_sorted.bam	TRUE	0	0
WT3_UV_RNAPII_ChIP	K002000093_53995_sorted.bam	TRUE	1	0
WT3_UV_HTZ_ChIP	K002000093_54003_sorted.bam	TRUE	0	1

Would I use a design like this: ~s(x, by = HTZ) + s(x, by = polII) ?

Would you recommend fitting models together or separately if I was interested in both polII and HTZ relative to background would it be best to use ~s(x) + s(x, by = polII) + s(x, by = HTZ) or do ~s(x) + s(x, by = polII) & ~s(x) + s(x, by = HTZ) seperately, or does it not particularly matter?

I'll try fitting the ~s(x) + s(x, by = polII) model with the minimal design matrix without the extra columns and with my whole matrix just specifying the desired column to make sure that this works fine and there is no subsetting issue.

Thank you for your patience!

gstricker commented 4 years ago

That's a valid question. The best way to understand it is to look at Figure 1b in the original paper.

In simple terms, what the model does is it interprets all the given data as a combination of different "signals". By specifying the formula you are giving it the number of "signals" you wish the data to be "decomposed" to. So for example, by giving the formula ~s(x) + s(x, by = polII) + s(x, by = HTZ) the model will take the data and see how much of the total data can be attributed to s(x), s(x, by = polII) etc. The total data is specified through the design matrix, such that the model knows which data is contributing to the "signal" and which isn't. For example, let's take the first two lines of your design matrix without HTZ

ID	file	paired	polII
Input_WT1_UV	K002000093_53939_sorted.bam	TRUE	0
WT1_UV_RNAPII_ChIP	K002000093_53947_sorted.bam	TRUE	1

Here the model will attribute any count abundance or depletion found in the RNAPII data compared to the Input data and attribute it to the PolII signal as you have specified with the 1 in the design matrix that this particular data contributes to the signal polII. Now, if we add another row for HTZ like in your previous example

ID	file	paired	polII	HTZ
Input_WT1_UV	K002000093_53939_sorted.bam	TRUE	0	0
WT1_UV_RNAPII_ChIP	K002000093_53947_sorted.bam	TRUE	1	0
WT1_UV_HTZ_ChIP	K002000093_53955_sorted.bam	TRUE	0	1

what we get is this: Any count difference in the HTZ data compared to Input will be attributed to the HTZ signal. Also, any count difference in RNAPII compared to Input will be attributed to polII signal. That is, the signal you get for RNAPII here should be the same as above without HTZ. In conclusion, the model basically looks at the three datasets and sees there is a certain number of counts that are present in all three datasets, so it will attribute it to the background signal (implied by the formula but not present in the design matrix). It will then look at the other counts that couldn't be attributed to the background and attribute them to RNAPII or HTZ based on the dataset it looks at. So, if, let's say, there is an equally big bump in counts at a certain position in the HTZ and RNAPII datasets you will see that bump in the polII as well as the HTZ signal. However, if we would set a 1 into the cell WT1_UV_RNAPII_ChIP - HTZ like that:

ID	file	paired	polII	HTZ
Input_WT1_UV	K002000093_53939_sorted.bam	TRUE	0	0
WT1_UV_RNAPII_ChIP	K002000093_53947_sorted.bam	TRUE	1	1
WT1_UV_HTZ_ChIP	K002000093_53955_sorted.bam	TRUE	0	1

then the bump would be attributed to the HTZ signal and RNAPII signal would be flat. This is because both datasets, RNAPII and HTZ, contribute to the HTZ signal whereas only the RNAPII dataset is contributing to the polII signal. Thus, having the same bump in counts in both datasets it's more likely it's coming from the HTZ signal as that signal takes both datasets into account.

Back to your original questions: What are the RNAPII levels relative to the combined background and HTZ levels - Not completely, the design matrix above works for the separate case that you mention in the end. The combined case requires you to set a 1 in the RNAPII - HTZ cells.

Hope that helps!

RichardJActon commented 3 years ago

Thanks for your help @gstricker, your explanation was clarifying. I have determined that the failure to converge is most likely to be due to coverage issues. I only have a mean of about 4.9 million properly mapped paired reads per sample (E. coli contamination issues). I think with the expected distribution of the RNAPII and HTZ marks and the total size of the features these marks are likely to cover in the C. elegans genome, I would probably need more like 15 million reads to achieve sufficient coverage. (based on: https://doi.org/10.1093/nar/gku178)

gstricker / GenoGAM

missing value error when calling genogam() #4