lter / soilHarmonization

Homogenize LTER Soil Organic Matter Working Group data and notes
https://lter.github.io/soilHarmonization/
Other
1 stars 4 forks source link

non-unique var ids in profile tab #14

Closed srearl closed 5 years ago

srearl commented 5 years ago

In configuring the key file version 2 update, I just noticed that many vars in the profile tab are repetitive within and across layer and fraction subcomponents. This could be a huge problem for units standardization.

srearl commented 5 years ago

could we add Level (e.g., layer, profile, fraction) to the units conversion files and join on that also?

wwieder commented 5 years ago

I'm not sure I completely follow your issue here, Stevan, but this is also a potentially large issue with homog too? In the ISRAD / Powell Center database they are using variable name (e.g soc_lyr, soc_frac, etc) to communicate if they are bulk layer or fraction measurements.

Our database has relatively little, can we append the 'var' name coming out of homog with a '_frac' suffix if they are fraction data?

On Sun, Dec 9, 2018 at 2:48 PM StevanEarl notifications@github.com wrote:

could we add Level (e.g., layer, profile, fraction) to the units conversion files and join on that also?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/srearl/soilHarmonization/issues/14#issuecomment-445574932, or mute the thread https://github.com/notifications/unsubscribe-auth/AHqLJHUsKfIWhJGGq4CGSAwQhiCx-dWZks5u3YUbgaJpZM4ZKTD- .

-- Will Wieder Project Scientist CGD, NCAR 303-497-1352

srearl commented 5 years ago

Hi Will - Sorry to trouble you with this, I mostly meant this as a reminder for myself. But, yeah, that is the issue. It struck when going through the key file update that a lot of variables have the same var name, e.g., there are two c_tot - one for bulk or general, and one for fraction. But there are many, many duplicates (or more than duplicates). I cannot recall for sure but do not remember accounting for this when standardizing units so this may be a problem. And, yes, thinking of a similar solution to what you suggest where I would append the layer name fraction or whatever to the var name.

srearl commented 5 years ago

This is a fundamental problem that far exceeds just the units conversion. The duplicate names are completely botching the units conversion but also, since all investigator-supplied header names get translated to the respective var name for that entity, there is no way of keeping track of, for example, the c_tot for layer and c_tot for fraction either within a data set or across data sets. var must absolutely be unique. I think I am going to have to go back and greatly expand the key file version 2 script to change any duplicate var names so that they are unique. A lot of the duplicate names are related to fractions (e.g., there are n=6 (15n)|(13c)|(14c)|(fraction_modern)) - a question for you is whether we are getting any fraction data? If none or very little, perhaps we can consider killing the fraction data altogether.

Obviously, we will have to postpone re-keying until we get this resolved.

@wwieder @piersond

piersond commented 5 years ago

@srearl We are not getting much fraction data from our providers thus far, but I'd argue it's one of the more important database components to include for future research. I'm probably a bit biased as soil fractions are a big part of my research, but nonetheless I think those variables are worth keeping.

Stevan, as I understand from above, your plan is to add on to the key V2 script so that it renames the redundant variables. If there's a place to help out in that work let me know. Would it be helpful if I or @wwieder put together a key with the new, unique var names?

srearl commented 5 years ago

Thanks, @piersond. Okay on keeping the fraction-related vars. Yes, if you want to come up with unique names for the duplicates, that would be awesome - and far more meaningful as you have a better sense for intuitive names. You would only need to do this for the duplicates so should not be too much work. Please be sure that the new names do not have spaces or special characters.

piersond commented 5 years ago

@srearl Sure thing, I'll tackle that in just a bit.

piersond commented 5 years ago

@srearl @wwieder As I'm renaming many of the variables to identify if they come from the location, layer or fraction, I wonder if we should standardize such naming throughout the entire dataset? e.g. put either a "loc", "lyr", or "frc_" in front of every variable name.

This would ensure data users know what level of data they're working with. I know we have a "Level" column for this already, but from my observance of data use at our last get together, I'm not sure that the Level column was being used or interpreted by the data users. Thoughts?

Plus, given that we already have that Level column in the keykey, I think such a naming change would be fairly easy to implement.

piersond commented 5 years ago

[Link to GDrive sheet for keykey var name changes] (https://docs.google.com/spreadsheets/d/1HnM1wSguzFe7HEBMsOYrB2vyjs7RTRpfRaI4RxnLdsA/edit?usp=sharing)

srearl commented 5 years ago

awesome @piersond - yeah, I like the approach of incorporating level into the name. right, as you suggest, I do not think we are otherwise including that in the output.

wwieder commented 5 years ago

Apologies for my radio silence yesterday. Reentry from time away is always challenging.

I made a few modifications to @piersond 's suggestions and think this looks good.

One question, that may not be worth considering. Would a more flexible approach be to add a loc, pro, lyr, frc suffix before the homogenized name based on the level already assigned in the key-key? Not sure if this would make scripting and generating the key-key_v2 pages any simpler?

srearl commented 5 years ago

key version 2 utility now includes functionality to change duplicate var names to new, unique names provided by @piersond

see 0208c079d7086fc4eef1d0cb9e17496f02c1fcab