nzhagen / jcamp

A set of Python utilities for reading JCAMP-DX files.
MIT License
54 stars 31 forks source link

XY checks and calculate X-values from metadata #37

Closed sdollinger closed 12 months ago

sdollinger commented 1 year ago

Related to PR-36 about ASDF decompression, x-values were not appropriately processed but should now be resolved with this PR.

During troubleshooting, it has been discovered that x-values were found to be not identical with device's actual x-values. To add, numbers of y-values do not match the number of x-values if we ignore the Y-Checks according to JCAMP-DX specification (https://iupac.org/wp-content/uploads/2021/08/JCAMP-DX_IR_1988.pdf).

Three insights from this specification:

  1. To generate a JCAMP-DX file out of IR spectra data, x-values are compressed into LASTX, FIRSTX, NPOINTS and XFACTOR. These metdata are mandatory, so, instead of from xy table, aren't x-values supposed to be extracted from this metadata?
  2. X-checks for each line in xy tables are implemented to ensure integrity of number of y-values. These are not necessilary used for extracting actual x-values.
  3. Y-checks for each line in xy tables are implemented to ensure integrity of DIF/DUP compression. This applies only if ASDF compression has been used.

This PR contains changes based on these insights. Also added a IR spectrum in ASDF format. Please let me know if there are any concerns!

nzhagen commented 1 year ago

@sdollinger: Thanks for the patch! Before checking it in, I tried running the testing script, but I got an error. I suspect that the error was already there before the patch, but I need to backtrack the repo and check. My current work has me busy for the next couple weeks, so I hope you can be patient while I find some time...

sdollinger commented 1 year ago

Sure, take your time to do it right. Actually, this patch also requires the changes in PR36. In the sake of understanding better, I decided to split those issues into PR36 and PR37. Maybe it helps to ensure that the changes in PR36 are also implemented in this PR?

Otherwise feel free to let me know the issue, maybe I can be of help.

nzhagen commented 1 year ago

I'll be glad to work on it once I get a little more time, but a short summary is as follows. In the repo, a previous contributor built the script "tests.py" to run when checking code changes. At the time it ran without errors, but it has been quite some time since we checked that file into the repo. Now, when I run the file, it gets an error when checking the file /data/mass_spectra/CH4_CI.jdx because that file does not have the now-required "FIRSTX" and "LASTX" metadata. Those are easy to add, and that fixes that error. There is another error, and also a test fail, in the file "/data/infrared_spectra/sulphur hexafluoride.jdx". The error message is TypeError: TestJcamp.test_xy_minmax() missing 1 required positional argument: 'testdict' However, in the code, every call to "test_xy_minmax()" includes the positional argument, so I haven't made any sense out of that yet. The test failure indicates that the number of data elements in "x" is not the same as in "y". I checked and saw that the number of elements in "y" is correct, at 56417. However, "x" somehow acquires 90,354 elements. This is something new occurring from PR#37 -- I haven't figured out what is going on yet. I have already checked in PR#36, and that alone does not give this error. Here is the output of running tests.py with PR#37 enabled:

====================================================================== ERROR: test_xy_minmax (main.TestJcamp.test_xy_minmax)

TypeError: TestJcamp.test_xy_minmax() missing 1 required positional argument: 'testdict'

====================================================================== FAIL: test_read_IR (main.TestJcamp.test_read_IR)

Traceback (most recent call last): File "/home/nh/repos/jcamp/tests.py", line 42, in test_read_IR self.test_xy_minmax(jcamp_dict) File "/home/nh/repos/jcamp/tests.py", line 18, in test_xy_minmax self.assertEqual(len(testdict['x']), len(testdict['y'])) AssertionError: 90354 != 56417


Ran 10 tests in 0.284s

nzhagen commented 1 year ago

Here are the first 10 elements of "x" and "y" before enabling PR#37: [575.05 575.1 575.15 575.2 575.25 575.3 575.35 575.41 575.47 575.53] [ 2.77053097e-05 -1.32697990e-05 1.01114290e-05 1.82973704e-05 -3.99391405e-05 -2.82098479e-05 -9.96777413e-06 3.82630699e-05 1.61336416e-05 -1.92046430e-05]

Here are the first 10 elements of "x" and "y" after enabling PR#37:

[575.049 575.08662925 575.12425851 575.16188776 575.19951702 575.23714627 575.27477552 575.31240478 575.35003403 575.38766329] [ 2.77053097e-05 -1.32697990e-05 1.01114290e-05 1.82973704e-05 -3.99391405e-05 -2.82098479e-05 -9.96777413e-06 3.82630699e-05 1.61336416e-05 -1.92046430e-05]

So the y-values remain the same, but there are more x-values.

sdollinger commented 1 year ago

Agreed that there is a mismatch of x-values in this file "sulphur hexafluoride.jdx".

Looking at this file in detail, the NPOINTS metadata (90354) of the file do not match the expected number of x-values in the XY table: 9402 lines each 6 y-values and last line with 5 values results in 56417 y-data points). Also DELTAX 0.0625 do not match with the expected value of 0.0602 = (3974.965 - 575.049) / 56417. So it seems that there is an inconsistency in the file itself... unless there is some other compression to reduce from 90354 xy data points to 56417 xy data points, which I can't explain when I look at the file alone.

This issue could be resolved by adjusting NPOINTS, so how about setting this metadata in that file to 56417? Then reading this file works smoothly with PR #36 as well as this PR.

bryanhanson commented 1 year ago

Just a lurker here, but using my R language readJDX package on "sulphur hexafluoride.jdx" I see the same problem where NPOINTS and the number of y values don't match, so that's two approaches suggesting there is a problem with the file.

https://github.com/bryanhanson/readJDX

nzhagen commented 1 year ago

@sdollinger: yes, fixing the NPOINTS metadata entry seems to fix the issue. What surprises me is that when I rerun the tests.py script after fixing that one, it comes up with new errors that it did not point out before. Most likely, your patch has revealed inconsistencies in several JDX files that I pulled from the literature to use as examples. These include the files /data/infrared_spectra/ethanol2.jdx /data/infrared_spectra/sulphur hexafluoride.jdx However, I also found that after fixing those I also got an error in /data/infrared_spectra/isopropanol_ASDF.jdx These errors all turned out to be the same type: changing NPOINTS fixes the error. I had to change NPOINTS inside "isopropanol_ASDF.jdx" from 9541 to 9504 to fix the issue. Does that sound correct to you? If that seems okay, then I will implement the file changes and merge the patch.

sdollinger commented 1 year ago

@nzhagen - great, thank you for this investigation! Looking closely at the decompressed values of this file /data/infrared_spectra/ethanol2.jdx, I notice that Y-checks are also implemented, i.e. the last y-value of current line and first y-value of next line must match. If we were to ignore these 206 Y-checks for each line, we would end up with 1764 y-values. I assume that NPOINTS was inaccurately set to 1970 (= 1764 + 206), perhaps because the editor did not know about these Y-checks at the time.

The error in the file isopropanol_ASDF.jdx is due to the lack of the DUP-related changes in PR #36, it seems to me that it's the only file with DUP digits in the xy-table. Nevertheless, I've updated added the same PR #36 change and also adjusted NPOINTS in both files mentioned above as you've suggested. Testing all infrared spectra files now seems to work smoothly.

nzhagen commented 12 months ago

@sdollinger: thanks for the patch! I'm glad to keep improving the code. And the new code revealed flaws in the example files, so that's another improvement. Thanks also to @bryanhanson for the error confirmation.

I still see that the /data/mass spectra/CH4_CI.jdx file is missing required metadata, but that file is something of a monster, and so I will leave that correction for a later date.