add tarball column to filter data by "control" = Y/N

piersond commented 5 years ago

Existing function may be useful. However, may be simple to work directly on tarball --> Use if statements to look for ctl value in levels, if found in any update "Control" column to YES, else NO. Wondering if I had a reason not to do it this way originally?

srearl commented 5 years ago

req's/considerations:

can be multiple control ids
studies without treatments will not have a control id so all data points are effectively controls
control id may be attribute to ANY treatment OR experiment level

srearl commented 5 years ago

sibling of: https://github.com/srearl/soilHarmonization/issues/34

srearl commented 5 years ago

Looking into the mismatch between the control samples that I identified and those that @piersond identified: @piersond - I did not look into why but your script is not consistently digging down to the sub-levels (e.g., below L1) nor flagging all non-experiment studies as controls.

However, as with just about everything with this project, it is not that straightforward, and those mismatches raise the question of what is the appropriate matching. A good example of this is the NWT snow fence study. Attached are two files: one with the control samples that I identified (_se) and a file with the control samples that @piersond (_dp) identified. It is a two-factor study looking at the presence of snow and nutrient treatments (there is also plant type but that is another, separate issue). The experiment details are in two different columns: tx_L1 (snow/no_snow), and tx_L2 (different nutrient treatments). no_snow and CC are identified as the control identifiers. The trick is that it is a nested design with multiple nutrient treatments, including no treatment (control (CC)), within both snow and no_snow treatments. Given the design, I think the only true controls are those characterized by both no_snow AND CC. Derek's approach and mine both misidentify the control samples since my approach identifies all samples with either no_snow or CC in the row as controls (thus snow + CC are controls), and Derek's because his identifies any no_snow, including those that received nutrient treatments, as controls.

Ugh. This is a case, I think, where samples should be controls if identifiers X AND Y are present, but it seems that most studies are going to be the case where samples are controls if identifiers X OR Y are present. I think we have to go with OR as there is no programmatic way that I can think of to distinguish whether n>1 control identifiers should be OR or AND, then address oddities like this one on a case-by-case basis. Thoughts?

nwt_saddferb_dp.txt nwt_saddferb_se.txt

@wwieder

srearl commented 5 years ago

now added to tarball processing script; though not certain, it appears that only NWT snow fence data has a combination of vars to denote control samples, script is operating under this assumption.

som_multiple_control_ids.txt

wwieder commented 5 years ago

Thanks Stevan. How close did these script you and Derek make end up on the number of 'control' samples?

On Thu, Oct 24, 2019 at 5:05 PM StevanEarl notifications@github.com wrote:

now added to tarball processing script; though not certain, it appears that only NWT snow fence data has a combination of vars to denote control samples, script is operating under this assumption.

som_multiple_control_ids.txt https://github.com/lter/lterwg-som/files/3769766/som_multiple_control_ids.txt

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lter/lterwg-som/issues/68?email_source=notifications&email_token=AB5IWJBDFEGHAB63VLLHXZLQQIS5PA5CNFSM4JB4HQBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECGWFJY#issuecomment-546136743, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5IWJDUUC4KELG2OZFUECDQQIS5PANCNFSM4JB4HQBA .

-- Will Wieder Project Scientist CGD, NCAR 303-497-1352

srearl commented 5 years ago

delta = 6,687

wwieder commented 5 years ago

Who's getting more? I'm assuming Derek, because his only uses an or filter, not an and?

srearl commented 5 years ago

@piersond and I both take an OR approach, I only applied AND to those NWT data where it was clear that AND for the multiple IDs was warranted. Otherwise, the reason for the discrepancies varies. I have not dug into the mechanics of Derek's script but it does not consistently span variables. A good example of this are the phys_chem_bio_forest_lawn_csv.csv (BES) data where Derek's script identifies samples as control when forest (a control id) is in the L1 column but not when forest is in the tx_L2 column. Not sure why since it does do this for some data. There also is some discrepancy with experiments. 366_Healy_Soil_C_and_N_inventory_b.csv is a good example of this where I am classifying everything as a control since experiments == NO but Derek's script only classifies samples with the control id flag as controls. Not sure there is a right or wrong here, just more to think about.

data\_file	se	dp
366\_Healy\_Soil\_C\_and\_N\_inventory\_b.csv	159	50
Copy of hf007-10-soil-prop-1995.csv	45	18
Copy of hf007-11-soil-prop-2000.csv	63	36
Copy of saddferb\_1.ts.data	47	190
e133\_Litter biomass	6618	3282
e133\_Plant aboveground biomass data	408	192
e133\_Root biomass data	3570	1680
e133\_Root ingrowth biomass	2368	1314
e133\_Soil net N mineralization over five incubation periods	80	32
e133\_Soil percent carbon and nitrogen	42	24
e133\_Soil pH	303	143
JRN\_368003\_ant\_nest\_organic\_matter\_data	27	10
phys\_chem\_bio\_forest\_lawn\_csv.csv	244	64
WW\_Copy\_soils\_pH\_2015.csv	24	96

srearl commented 4 years ago

I think we can close this, but we should keep in mind that there are some ambiguities about what are control data and that it is possible to have (and to) overlook particular cases

lter / lterwg-som

add tarball column to filter data by "control" = Y/N #68