standardize throws an error because of package lmfit

japhir commented 3 years ago

after having solved the first output errors in #5, I immediately ran into another issue when calling mydata.standardize():

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/py2vPmgk", line 3, in <module>
  File "/tmp/babel-UvbDtD/python-jGCwX1", line 1, in <module>
    mydata.standardize()
  File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1006, in newfun
    out = oldfun(*args, **kwargs)
  File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1625, in standardize
    params.add(f'D{self._4x}_{pf(sample)}', value = 0.5)
  File "/home/japhir/.local/lib/python3.9/site-packages/lmfit/parameter.py", line 373, in add
    self.__setitem__(name, Parameter(value=value, name=name, vary=vary,
  File "/home/japhir/.local/lib/python3.9/site-packages/lmfit/parameter.py", line 137, in __setitem__
    raise KeyError("'%s' is not a valid Parameters name" % key)
KeyError: "'D47_AU002_(2)' is not a valid Parameters name"

Note that the lines referring to /tmp/... etc. are there because I'm running python from an orgmode file in emacs.

I'm not sure how to debug this one, so I'll leave it as an issue here! I'm sure it must be something in my dataset (cannot share unfortunately, latest measurements) because I didn't get this error when running the code on your example tiny dataset. My dataset is structured as follows:

UID,Session,Sample,d45,d46,d47,d48,d49
21401,2021-05-12,ETH-2,-7.442475902551424,-5.885893603532355,-13.981637618936443,-12.401052771990098,-19.14472530378544
...

I've made sure there are no NAs in there, so I'm a bit confused.

I'm running python 3.9.6 with D47crunch 2.0.2, which in my case relies on lmfit 1.0.2, matplotlib 3.4.3, scipy 1.7.1 and numpy 1.21.2

mdaeron commented 3 years ago

KeyError: "'D47_AU002_(2)' is not a valid Parameters name"

That is an error from lmfit, because lmfit model parameter names cannot use a number of special characters (because it clashes with the expression of mathematical constraints such as param_a = 1/(param_b+1)). From here:

the name must match [a-z_][a-z0-9_]* and cannot be a Python reserved word.

D47crunch uses the (crude) function pf() to solve common sample name problems.

option 1: edit pf() to handle parentheses. Hacky but easy.
option 2: instead of parameter names, use a well-chosen hash of the sample name. Safe (no collisions), but now when an error is raised it will be annoying to look up which sample is the problem.

Any preference between these two options?

japhir commented 3 years ago

Thanks for the quick reply! Ah, I guess an informative error would suffice here, really. I think it makes total sense not to allow parentheses or other weird characters in the UIDs. I'll just get that stuff out in R and then try again.

japhir commented 3 years ago

Hmm after trying to resolve this, I guess forcing the sample id's to not start with a number might be limiting. I got this to work by replacing all (, ), [, ], and + with _, but there are also many samples that have ^[a-z0-9_]*. Furthermore, it seems like all ETH-1, ETH-2's etc. with their - are also a problem. EDIT: hmm that cannot be the problem, as it's now explicitly asking for an ETH-1, ETH-2, and ETH-3.

mdaeron commented 3 years ago

Could you test whether ed04198dd73368668aa4a1faa389cc085753b55a works for you? It uses a 6-digit hash followed by a sanitized version of the original sample name. I expect it to work well with any arbitrary sample name.

japhir commented 3 years ago

This seems to work! The standardize step is pretty slow (several minutes for a 552 samples and 1959 anchors) but I have no idea how fast it should be, and computing session ETFs with fancy maths is also very slow in my R version. It didn't throw an immediate error about the wrong Sample names this time and seems to have worked out okay!

mdaeron / D47crunch

standardize throws an error because of package lmfit #6