Improve `plot_distribution_of_analyses` or throw warnings in case there are many sessions/samples

japhir commented 2 years ago

Ok I'll file it if I get the example output working. I thought it might be because I'm putting waaaay too many unique Samples in there, but if you used it for the Fiebig study I guess they also had many replicates. I think in general it's best to keep code that isn't strictly necessary/useful to almost all users out of this repo and include it in the repository that accompanies the paper in stead. This will make the codebase easier to maintain.

Agreed in principle, but such a plot could be pretty useful for anybody, there's nothing specific to Jens' study here. Again, if it works poorly for your use case, let me know and we can improve it if you feel it's worth it.

Originally posted by @mdaeron in https://github.com/mdaeron/D47crunch/issues/5#issuecomment-916969626

I think the main issue with my outputs are that while I have only 3 sessions, I did run the script for ALL measurements ever, so the number of unique Samples was huge, meaning that it had so many rows in the figure that it was illegible and basically a rectangle full of blue points.

mdaeron commented 2 years ago

In that case a quick and dirty improvement would be to let users arbitrarily redefine the figure dimensions. Could you play with 6e71f67c9be521442dcfa46d6702d330fbc064f9 and let me know if it helps?

japhir commented 2 years ago

That helps! And works.

But I think I misdiagnosed the problem: I was thinking of "Samples" as replicate measurements of the same exact sample (i.e. IODP Leg 208 Site 1264 Hole A 28H2 96 cm–97.5 cm) but for clumped we need to calculate averages over longer time scales to get good reproducibility, and thus a "sample" is now something else: a period of time where you assume the material to be homogeneous enough to chuck it all together.

Obviously we're not at a stage yet where we need 100s of rows in this figure, because that would be a crazy well replicated study!

I guess I should have re-read your paper prior to messing around with the code and complaining that it wasn't working ;-)

mdaeron commented 2 years ago

FTR:

"sample" = an amount of presumably homogeneous carbonate material. Each sample should be uniquely identified by a sample name (field Sample in the csv file).
"analysis" or "replicate" = corresponds to a single acid reaction followed by purification of the evolved CO2 and by a series of dual-inlet IRMS measurements. Each analysis is identified by a unique identifier (field UID in the csv file, but if it's missing a default series of UIDs will be generated).

japhir commented 2 years ago

Thanks, just to clarify further: If I specify UID to mean one acid reaction and Sample to mean the things that I want to calculate averages for, is there a way to account for other "block" effects, in this case the unique label that was sieved/picked/cleaned separately but may have multiple analyses? Or, say, two benthic species picked from the same sample that we want to average together but might display some block effects, especially on d13C and d18O?

mdaeron commented 2 years ago

Just in case, beware that by defining a Sample (in D47crunch terminology), you postulate that it is isotopically homogeneous, and this assumption will play into your estimates of analytical repeatability. So you should not assign identical sample names to materials that you expect to have different isotopic compositions just for the sake of estimating their average Δ47 value.

Instead, even if you treat materials of different isotopic compositions as different samples, you may still average their Δ47 values after the standardization step, using D4xdata.sample_average():

# compute the unweighted average of Δ47(sample1), Δ47(sample3), and Δ47(sample3)
# and the corresponding standard error:
mydata.sample_average(samples = ['sample1', 'sample2', 'sample3'])

# compute the weighted average of Δ47(sample1), Δ47(sample3), and Δ47(sample3)
# with respective weights of 1, 2, and 3, and the corresponding SE:
mydata.sample_average(
    samples = ['sample1', 'sample2', 'sample3'],
    weights = [1, 2, 3])

# compute the difference between Δ47(sample2) and Δ47(sample1)
# and the corresponding SE:
mydata.sample_average(
    samples = ['sample1', 'sample2'],
    weights = [-1, 1])

Another option is to use D4xdata.combine_samples() which specifically designed to compute average Δ47 values for bins of samples. This function also returns the full covariance matrix for these binned Δ47 values. For example, if samples foram_10, foram_11, and foram_12 are all Holocene and samples foram_18, foram_19 correspond to the LGM, calling combine_samples(dict(Holocene = ['foram_10', 'foram_11', 'foram_12'], LGM = ['foram_18', 'foram_19'])) will return the average Δ47 value for the Holocene group of samples, the average value for the LGM group, and the 2x2 covariance matrix of these two averages. This is how I would personally approach your benthic species example.

japhir commented 2 years ago

Hi again Mathieu!

After all this time I'm taking another look at what you proposed here. I've now managed to calculate clumped values for each 2-cm sample separately. To combine them, I first had to create a massive dict that contains all 1335 samples. I first tried to read in a csv that links the bin to a sample name, but struggled too much and in the end made a massive dict definition just using some search/replace on a csv export.

Now I managed to use combine_samples() with this dict, and it returns, as it says on the tin:

the list of group names
an array of the corresponding Δ4x values
the corresponding (co)variance matrix

but now I'm not sure how to combine these again so that I get a table of:

| bin id | D47 average | D47 95% CI |

Basically I don't know how to calculate the 95% CI from a NxN covariance matrix and how to export the stuff to a csv.

I also noticed that the averages are different from the ones that I get in my R script, but this may be because of the ETF pooling that you do, or the offset correction that we do. Will have to look into that part later.

Any help would be appreciated! :)

mdaeron commented 2 years ago

To combine them, I first had to create a massive dict that contains all 1335 samples. I first tried to read in a csv that links the bin to a sample name, but struggled too much and in the end made a massive dict definition just using some search/replace on a csv export.

For the record, one simple way to do this kind of thing is:

csv_contents = '''
sample1, bin1
sample2, bin1
sample3, bin1
sample4, bin2
sample5, bin2
sample6, bin2'''[1:]

bindata = [l.split(', ') for l in csv_contents.split('\n')]
bins = {_[1] for _ in bindata} # NB this is a set, not a dict
bindict = {b: [sample for sample, bin in bindata if bin == b] for b in bins}

# this results in:
# 
# bindict = {
#   'bin2': ['sample4', 'sample5', 'sample6'],
#   'bin1': ['sample1', 'sample2', 'sample3'],
#   }

Now I managed to use combine_samples() with this dict, and it returns, as it says on the tin:

the list of group names

an array of the corresponding Δ4x values

the corresponding (co)variance matrix

but now I'm not sure how to combine these again so that I get a table of:

| bin id | D47 average | D47 95% CI |

Basically I don't know how to calculate the 95% CI from a NxN covariance matrix and how to export the stuff to a csv.

By default I estimate 95% confidence limits by multiplying the standard error for each bin (from the diagonal elements of the covariance matrix) by the relevant Student's t factor for the number of degrees of freedom in the standardization model. This t factor is conveniently stored in the D47data object's t95 property. So if your D47data instance is mydata, and mydata.combine_samples() returns mybins, D47_avg, D47_CM, you could do:

with open('binned_output.csv', 'w') as fid:
    fid.write('bin,D47,D47_SE,D47_95CL')
    for k in range(len(bins)):
        fid.write(f'\n{mybins[k]},{D47_avg[k]},{mydata.t95 * D47_CM[k,k]**0.5}')

I also noticed that the averages are different from the ones that I get in my R script, but this may be because of the ETF pooling that you do, or the offset correction that we do. Will have to look into that part later.

Also, remember that binned samples are weighted by number of replicate analyses, which is a reasonable default behavior but is not always optimal. If you want to use different weights, look at the code; it should be simple to add an option to weigh samples equally. If this is something you think is necessary, feel free to let me know / open an issue / submit a PR.

Any help would be appreciated! :)

Hope this helps. I'm just happy someone is using this library apart from myself!

mdaeron / D47crunch

Improve `plot_distribution_of_analyses` or throw warnings in case there are many sessions/samples #7