jjhelmus / nmrglue

A module for working with NMR data in Python
BSD 3-Clause "New" or "Revised" License
208 stars 85 forks source link

Only consider relevant DATATYPEs (NMR SPECTRUM / FID) when reading JCAMP-DX #120

Closed JLVarjo closed 4 years ago

JLVarjo commented 4 years ago

Quite often there's all kinds of metadata embedded in JCAMP-DX files. This PR changes the parser behavior to consider only the relevant DATATYPE sections, namely NMR SPECTRUM and NMR FID. Error messaging is also improved for cases where no data is found after all.

In addition, this PR fixes a bug in recognition of JCAMP pseudodigit format.

kaustubhmote commented 4 years ago

@JLVarjo, This looks good!

I have only one small suggestion. It seems like as it stands currently, only NMRSPECTRUM and NMRFID are considered valid. I think some of the other parameters such as NMRPEAKASSIGMMENTS might also be useful. Although the parsing for these datatypes is not currently not ideal, one can still get the data then manually parse these datasets.

Ideally, one can make a allowed_datatype list and just check whether the datatype is in this list before adding it. Something like:

allowed_datatype = ["NMRSPECTRUM", "NMRFID", "NMRPEAKTABLE", "NMRPEAKASSIGNMENTS" , "LINK"]
if datatype.strip().upper().replace(" ", "") in allowed_datatype:
...

This way, if there are other datatypes that are considered valid later, we could just add them to this list. What do you think?

JLVarjo commented 4 years ago

The actual problem behind this PR was that sometimes the different DATATYPE sections have the same data tags, which messed up things if they end up to one dict. Which is why I decided here to just dump the irrelevant ones. Your idea to store other data as well is feasible though. What if I make _readrawdic to return a dict of dicts instead, i.e. the key to outer dict is the DATATYPE? The actual parser could then continue with the spectrum/fid but all the other data is still returned to user.

kaustubhmote commented 4 years ago

I see. In that case, what do you think about returning a dict of dicts, but also unpacking NMRFID and NMRSPECTRUM parts so that the default behaviour is not changed?

JLVarjo commented 4 years ago

I assume you want the main read() function to return only (dic, data) tuple as all other parsers? So how about the following dic structure: the base level contains the data entries of the main datatype (NMRSPECTRUM or NMRFID) as currently, and all the data from other datatypes are in sub-dicts, for example with keys such as _datatype=LINK_<n>, _datatype=NMRPEAKASSIGNMENTS_<n> etc., in which n is a running index to the section in the file, in the case of multiple similar ones?

kaustubhmote commented 4 years ago

Yes, that sounds perfect. This way, none of the existing codes need to change. Will it be possible to have this as a list instead of running index? For example, could this be refactored so that the other datatypes are accessed by dic["LINK"][0], dic["LINK"][1] and so on?

JLVarjo commented 4 years ago

Even better! Will make the changes soon.

JLVarjo commented 4 years ago

Changed the behavior as discussed. I think we have to use some prefix on the subdict keys, or they may clash with the DATATYPE tag of the actual data section itself. Chose to use _datatype_ here.

There is also a dummy test case to check that the dictionary structure is correct, please find it attached here: dicstructure.zip

kaustubhmote commented 4 years ago

@JLVarjo This looks good to me. I am merging this now. Thanks, especially for the additional test. I think it really helps understand what the function does.

I have an additional suggestion here. I think the function _readrawdic is probably ripe for a refactoring into separate reading and a parsing function. Currently, I think it is quite long and seems quite a lot, and can potentially become hard to maintain as we add new functionality. This is not urgent of course, but something to keep in mind.

JLVarjo commented 4 years ago

Yes, can agree with you on that. Thanks for the pull!