mdaeron / D47crunch

Python library for processing and standardizing carbonate clumped-isotope analyses, from low-level data out of a dual-inlet mass spectrometer to final, “absolute” Δ47, Δ48, and Δ49 values with fully propagated analytical error estimates.
Other
8 stars 3 forks source link

crunch throws an error with failed d45 type #5

Closed japhir closed 2 years ago

japhir commented 2 years ago

Hi Mathieu!

I've been trying out D47crunch for our data!

The first issue I had was that our UIDs and Sessions were integers and datetimes respectively, and that I had to cast them to character to get mydata.crunch() to work. Otherwise it would throw this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/pygzCIH1", line 9, in <module>
  File "<string>", line 1, in <module>
  File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1006, in newfun
    out = oldfun(*args, **kwargs)
  File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1274, in crunch
    self.compute_bulk_and_clumping_deltas(r)
  File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1349, in compute_bulk_and_clumping_deltas
    R45 = (1 + r['d45'] / 1000) * R45_wg
TypeError: unsupported operand type(s) for /: 'str' and 'int'

I think I managed to fix this by first casting them to characters and then running it again.

mdaeron commented 2 years ago

Good point about the documentation.

We could use str() when reading the Session field, with the following results:

from datetime import date, datetime
import arrow # another popular option

str(date.today())   # returns '2021-09-10'
str(datetime.now()) # returns '2021-09-10 12:03:37.940927'
str(arrow.now())    # returns '2021-09-10T12:04:02.268544+02:00'

Would that work for your use case?

japhir commented 2 years ago

Am I correct in that arrow includes the timezone and datetime relies on UTC?

I am an absolute beginner in python, so have no idea of what packages are out there and what they do, and how to modify the examples you listed in the docs to my liking ;-). I basically used R to generate the csv in the desired structure and then ran below in sequence:

  import D47crunch
  mydata = D47crunch.D47data()
  mydata.read('/home/japhir/SurfDrive/PhD/programming/dataprocessing/out/rawdata.csv')
  mydata.wg()  # as an aside, where do I tell it the d13C and d18O of the working gas?
  mydata.plot_distribution_of_analyses() # gives an unusable plot if you have many measurements, in my case 1779 measurements only (that's a tiny subset of everything since may 2021)
  mydata.crunch() # this gave me the error if I didn't make sure UID and Session were characters
  mydata.standardize() # this still gives me the error in #6 
  mydata.summary(verbose = True, save_to_file = False) # can only reach this step with the simulated data

the #6 in the comment of the code block doesn't create a link, so here it is.

mdaeron commented 2 years ago
  1. Going back to your original first message above, the error you report has nothing to do (I think) with the Session field. Instead, it's complaining that r['d45'] is a string. After mydata.read(), what does type(mydata[0]['d45']) return?
  2. mydata.wg() is only used if you want to compute the WG composition based on your δ13C and δ18Ο standards. If you want to specify arbitrary values for the WG, add fields d13Cwg_VPDB and d18Owg_VSMOW to your csv file. This used to be included in the documentation but it was dropped at some point; I'll put it back in. I personally advise to use D47data.wg() rather than nominal WG values, but YMMV of course.
  3. plot_distribution_of_analyses(): maybe open another issue with an example of the output, and suggestions re: how to improve the plot. I've only used it once, for the data in Fiebig et al. (2021), but the plot could be heavily customized using function keywords if needed.
  4. crunch(): I don't understand how UID and Session end up not being strings, unless you used the 'session' keyword in D47data.read()? In that case, yes, the Session field needs to be a string. That doesn't explain why UID is not a string. The relevant code is below:
data = [{k: v if k in ['UID', 'Session', 'Sample'] else smart_type(v) for k,v in zip(txt[0], l) if v != ''} for l in txt[1:]]

if session != '':
    for r in data:
        r['Session'] = session
japhir commented 2 years ago
  1. I was able to get the code running only if I took a small subset of my whole dataset. I thought that there were some rows that force the whole d45 column to be a string, however, for one of the subsets that breaks (n = 1000) I can get the same error while type(mydata[0]['d45']) <class 'float'>
  2. thanks for the explanation! I'll have a look at whether the wg computed value is very different from the one specified by our gas provider.
  3. Ok I'll file it if I get the example output working. I thought it might be because I'm putting waaaay too many unique Samples in there, but if you used it for the Fiebig study I guess they also had many replicates. I think in general it's best to keep code that isn't strictly necessary/useful to almost all users out of this repo and include it in the repository that accompanies the paper in stead. This will make the codebase easier to maintain.
  4. I think I'm mixing issues here, sorry about that. If I filter to rows where the above error doesn't take place, this error pops up with
    >>>  mydata.standardize()
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/tmp/pyPjNI3W", line 3, in <module>
    File "/tmp/babel-yqvP9s/python-GXTFNT", line 1, in <module>
    mydata.standardize()
    File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1006, in newfun
    out = oldfun(*args, **kwargs)
    File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1618, in standardize
    params.add(f'a_{s}', value = 0.9)
    File "/usr/lib/python3.9/site-packages/lmfit/parameter.py", line 373, in add
    self.__setitem__(name, Parameter(value=value, name=name, vary=vary,
    File "/usr/lib/python3.9/site-packages/lmfit/parameter.py", line 137, in __setitem__
    raise KeyError("'%s' is not a valid Parameters name" % key)
    KeyError: "'a_2020_01_03T00:00:00Z' is not a valid Parameters name"

    If I then re-implement my manual fix to convert the Session to character prior to export to csv and re-run, I get the other issue #6 where the Sample names cause issues

    >>>   mydata.standardize()
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/tmp/pykIhOHv", line 3, in <module>
    File "/tmp/babel-yqvP9s/python-EUrohX", line 1, in <module>
    mydata.standardize()
    File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1006, in newfun
    out = oldfun(*args, **kwargs)
    File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1625, in standardize
    params.add(f'D{self._4x}_{pf(sample)}', value = 0.5)
    File "/usr/lib/python3.9/site-packages/lmfit/parameter.py", line 373, in add
    self.__setitem__(name, Parameter(value=value, name=name, vary=vary,
    File "/usr/lib/python3.9/site-packages/lmfit/parameter.py", line 137, in __setitem__
    raise KeyError("'%s' is not a valid Parameters name" % key)
    KeyError: "'D47_AU002_(2)' is not a valid Parameters name"
mdaeron commented 2 years ago
  1. I was able to get the code running only if I took a small subset of my whole dataset. I thought that there were some rows that force the whole d45 column to be a string, however, for one of the subsets that breaks (n = 1000) I can get the same error while type(mydata[0]['d45']) <class 'float'>

mydata[0] is only the first line. To find the offending lines, you can try:

for k,r in enumerate(mydata):
    try:
        r['d45'] / 1000
    except:
        print(k, r['d45'], type(r['d45'])

This will print out the index, r['d45'] and type for the offending lines.

mdaeron commented 2 years ago
  1. Ok I'll file it if I get the example output working. I thought it might be because I'm putting waaaay too many unique Samples in there, but if you used it for the Fiebig study I guess they also had many replicates. I think in general it's best to keep code that isn't strictly necessary/useful to almost all users out of this repo and include it in the repository that accompanies the paper in stead. This will make the codebase easier to maintain.

Agreed in principle, but such a plot could be pretty useful for anybody, there's nothing specific to Jens' study here. Again, if it works poorly for your use case, let me know and we can improve it if you feel it's worth it.

japhir commented 2 years ago

Aargh now It's breaking again but this time type(mydata[0]['d45']) is <class 'str'>.

I think it's because the Sample names in the subset that is causing issues are things like "ETM-2_690_5-83,85(Ntr)", which needs to be quoted, while some others are not quoted.

I've looked into how my R csv export decides to quote and/or escape things. By default it quotes "as needed", meaning that the above one with parentheses etc. gets quoted, but a simple ETH-3 doesn't. Maybe the offender is "ETM-2_690_4-138,145(Ntr)[2]"? With quotes, parentheses AND square brackets this one is a winner in messing up computers ;-). I found this blogpost about reading/writing csv files with edge cases. For now I will simply transform these names manually?

After specifically filtering out any Sample with (, ), [, ] I think it's running smoothly now! filter(!str_detect(identifier_1, "\\(|\\)|\\[|\\]")) %>%

mydata[0] is only the first line. To find the offending lines, you can try:

wait, so python's rows are by default lists that can contain different types for each row? o.O

mdaeron commented 2 years ago

I realized your problem is definitely caused by commas in sample names. The csv file uses commas as separators by default so that inserts spurious columns in your data. Using tabs or semicolons instead should help a lot.

wait, so python's rows are by default lists that can contain different types for each row? o.O

Not always. Python has lists of items, where each item can be any object, even another list. There is no concept of rows for such lists. In this case D4xdata objects are built as lists of dictionaries. A dictionary is a set of keys (e.g., of type int, or str), with each key associated to an arbitrary object (which could be a list another dictionary, among others). So mydata[0] is the first item (a dictionary, in this case, which describes a single analysis) in the list mydata, and mydata[2]['d45'] is the value associated with key d45 in the third item (=analysis) of mydata.

There is no obligation that all items in mydata have the same fields. You could add arbitrary fields (e.g., mineralogy) to some analyses., and test for the presence of these fields using the following code:

for r in mydata:
    if 'mineralogy' in r:
        print(f'Analysis {r["UID"]}: {r["mineralogy"]}')
japhir commented 2 years ago

I realized your problem is definitely caused by commas in sample names. The csv file uses commas as separators by default so that inserts spurious columns in your data. Using tabs or semicolons instead should help a lot.

ah yes, that must be it. Is there any escape syntax that you can use during export to keep the original identifier_1 column? The R export has quoted the fields, so the commas are technically correct csv syntax right? The link I shared before has some python csv package code that specifies how to handle quote syntax etc.

japhir commented 2 years ago

Not always. Python has lists of items, where each item can be any object, even another list. There is no concept of rows for such lists. In this case D4xdata objects are built as lists of dictionaries. A dictionary is a set of keys (e.g., of type int, or str), with each key associated to an arbitrary object (which could be a list another dictionary, among others). So mydata[0] is the first item (a dictionary, in this case, which describes a single analysis) in the list mydata, and mydata[2]['d45'] is the value associated with key d45 in the third item (=analysis) of mydata.

There is no obligation that all items in mydata have the same fields. You could add arbitrary fields (e.g., mineralogy) to some analyses., and test for the presence of these fields using the following code:

for r in mydata:
    if 'mineralogy' in r:
        print(f'Analysis {r["UID"]}: {r["mineralogy"]}')

Thanks for the elaboration :)

mdaeron commented 2 years ago

ah yes, that must be it. Is there any escape syntax that you can use during export to keep the original identifier_1 column? The R export has quoted the fields, so the commas are technically correct csv syntax right? The link I shared before has some python csv package code that specifies how to handle quote syntax etc.

I confess that I never took the time to use a real csv parser as you suggest. I'll add this as a todo item for the dev version.