Closed piersond closed 4 years ago
req's/considerations:
Looking into the mismatch between the control samples that I identified and those that @piersond identified: @piersond - I did not look into why but your script is not consistently digging down to the sub-levels (e.g., below L1) nor flagging all non-experiment studies as controls.
However, as with just about everything with this project, it is not that straightforward, and those mismatches raise the question of what is the appropriate matching. A good example of this is the NWT snow fence study. Attached are two files: one with the control samples that I identified (_se) and a file with the control samples that @piersond (_dp) identified. It is a two-factor study looking at the presence of snow and nutrient treatments (there is also plant type but that is another, separate issue). The experiment details are in two different columns: tx_L1
(snow
/no_snow
), and tx_L2
(different nutrient treatments). no_snow
and CC
are identified as the control identifiers. The trick is that it is a nested design with multiple nutrient treatments, including no treatment (control (CC
)), within both snow
and no_snow
treatments. Given the design, I think the only true controls are those characterized by both no_snow
AND CC
. Derek's approach and mine both misidentify the control samples since my approach identifies all samples with either no_snow
or CC
in the row as controls (thus snow
+ CC
are controls), and Derek's because his identifies any no_snow
, including those that received nutrient treatments, as controls.
Ugh. This is a case, I think, where samples should be controls if identifiers X AND Y are present, but it seems that most studies are going to be the case where samples are controls if identifiers X OR Y are present. I think we have to go with OR as there is no programmatic way that I can think of to distinguish whether n>1 control identifiers should be OR or AND, then address oddities like this one on a case-by-case basis. Thoughts?
nwt_saddferb_dp.txt nwt_saddferb_se.txt
@wwieder
now added to tarball processing script; though not certain, it appears that only NWT snow fence data has a combination of vars to denote control samples, script is operating under this assumption.
Thanks Stevan. How close did these script you and Derek make end up on the number of 'control' samples?
On Thu, Oct 24, 2019 at 5:05 PM StevanEarl notifications@github.com wrote:
now added to tarball processing script; though not certain, it appears that only NWT snow fence data has a combination of vars to denote control samples, script is operating under this assumption.
som_multiple_control_ids.txt https://github.com/lter/lterwg-som/files/3769766/som_multiple_control_ids.txt
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lter/lterwg-som/issues/68?email_source=notifications&email_token=AB5IWJBDFEGHAB63VLLHXZLQQIS5PA5CNFSM4JB4HQBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECGWFJY#issuecomment-546136743, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5IWJDUUC4KELG2OZFUECDQQIS5PANCNFSM4JB4HQBA .
-- Will Wieder Project Scientist CGD, NCAR 303-497-1352
delta = 6,687
Who's getting more? I'm assuming Derek, because his only uses an or filter, not an and?
@piersond and I both take an OR approach, I only applied AND to those NWT data where it was clear that AND for the multiple IDs was warranted. Otherwise, the reason for the discrepancies varies. I have not dug into the mechanics of Derek's script but it does not consistently span variables. A good example of this are the phys_chem_bio_forest_lawn_csv.csv
(BES) data where Derek's script identifies samples as control when forest
(a control id) is in the L1 column but not when forest
is in the tx_L2 column. Not sure why since it does do this for some data. There also is some discrepancy with experiments. 366_Healy_Soil_C_and_N_inventory_b.csv
is a good example of this where I am classifying everything as a control since experiments == NO
but Derek's script only classifies samples with the control id flag as controls. Not sure there is a right or wrong here, just more to think about.
data\_file | se | dp |
---|---|---|
366\_Healy\_Soil\_C\_and\_N\_inventory\_b.csv | 159 | 50 |
Copy of hf007-10-soil-prop-1995.csv | 45 | 18 |
Copy of hf007-11-soil-prop-2000.csv | 63 | 36 |
Copy of saddferb\_1.ts.data | 47 | 190 |
e133\_Litter biomass | 6618 | 3282 |
e133\_Plant aboveground biomass data | 408 | 192 |
e133\_Root biomass data | 3570 | 1680 |
e133\_Root ingrowth biomass | 2368 | 1314 |
e133\_Soil net N mineralization over five incubation periods | 80 | 32 |
e133\_Soil percent carbon and nitrogen | 42 | 24 |
e133\_Soil pH | 303 | 143 |
JRN\_368003\_ant\_nest\_organic\_matter\_data | 27 | 10 |
phys\_chem\_bio\_forest\_lawn\_csv.csv | 244 | 64 |
WW\_Copy\_soils\_pH\_2015.csv | 24 | 96 |
I think we can close this, but we should keep in mind that there are some ambiguities about what are control data and that it is possible to have (and to) overlook particular cases
Existing function may be useful. However, may be simple to work directly on tarball --> Use if statements to look for ctl value in levels, if found in any update "Control" column to YES, else NO. Wondering if I had a reason not to do it this way originally?