rfl-urbaniak / MRbook

0 stars 0 forks source link

accuracy, imprecision_philosophical_paper2.Rmd revised #10

Open rfl-urbaniak opened 11 months ago

rfl-urbaniak commented 11 months ago

Resolves #11 .

Doesn't compile (the old natbib problem). Nikodem will have to check compilation and use downstream for the quarto version.

rfl-urbaniak commented 10 months ago

@marcellodibello @Niklewa

The revised version is now up for a batch of Marcello's modifications, and potentially a small PR from Nikodem if he's interested.

  1. Precise v. imprecise probabilism: good, just minor things
    • The description of belief inertia is not completely clear to me. Explain basic intuition (a few lines) or just add a reference.

REPLY: added some explanation, marked with \todo, check.

    • The Rinard example is intriguing but I also do not understand it. Maybe add a few lines about the key intuitions?

REPLY: revised the explanation, revision marked with a margin note, check at some point.

    • Is the objection to imprecise probabilism about compatibility a novel objection or did others already formulate it? Just make that clear. The objections is otherwise very interesting.

REPLY: expanded on this, check.

  1. Higher order probabilism: good, just some minor things
    • In talking about expectation (p. 7), why do you say that x has to be the "objectively appropriate" degree of belief? Why not simply say that x is a degree of belief, whether appropriate or not? Why do you need that qualification? It seems controversial and I am not sure it is necessary at this point.

REPLY: added some explanation about this.

    • For readability sake, it'd be good to show that the objections against imprecise probabilism do not apply to higher order probabilism following the same order the objections were presented earlier. It'd be good to briefly recap the objections first and then show that they do not apply here. This is just a minor expository tweak.

!REPLY:

    • How does higher order probabilism address the "compatibility objection" against imprecise probabilism?

REPLY: added a take on this, marked with a margin note, check

    • I am not too impressed by the last two paragraphs of this section. Let's keep them for now, but I need to think about them.

REPLY: yeah, marked on a margin for now, we'll get back to this at some point.

  1. One more question: maybe it is helpful to show why imprecise probabilities does not run into the same potential issues that higher order probabilism runs into. Let me explain. Higher order probabilism needs to explain how we conceptualize higher order probabilities (p. 9). Instead, imprecise probabilism can simply say that an agent can entertain many (compatible) probabilities. Is this why the problem does not arise for imprecise probabilism? Is this an advantage of imprecise probabilism?

REPLY: tried to add some explanation of this, marked on the margin in the paper.

  1. The two sections about the legal examples do not work well with the rest of the paper. I am not sure what to do with them. I need to think about them. They break the flow of the argument. We need to think of how to integrate them or remove them. To be discussed. DISCUSSION: redress as data combination challenge against both probabilism and imprecise probabilism

  2. Accuracy in the second order section: great section, but some issues below

    • This section can come right after the section on higher probabilism

REPLY: moved


    • The structure of the section is not completely clear to me, a bit more signposting would help, need to discuss
    • The last paragraph of the section should come a bit earlier, probably before the example
    • There is a lot of formal stuff here (needed), but adding intuitive explanations here and there will help the reader

I propose Nikodem takes a stab at this in a new PR off this one. I'll discuss this with him.


Discussion

    • I am not sure about the hierarchical model discussion. Need to think about this more.

REPLY: Sure.

    • The penultimate paragraph (discussing Bradely) is very interesting. I wondering if this should go somewhere else.

REPLY: feel free to move it as appropriate when doing your round of revisions.

    • The last paragraph about semantics is also very interesting. I wonder if all the philosophical challenges about the higher order approach (semantics, conceptualization of higher order probabilities etc.) should form a separate section.

REPLY: I'm inclined to say no, as this paper is already very long, and we don't have anything new to say other than hey if you're worried take a look at this piece of literature.

  1. I have an idea on how to fix the section on the two items of match evidence. The general line of argument in the paper is that higher order probabilism can address problems that imprecise probabilism fails to address (belief inertia, proper scoring rules, etc.). Now, the section on the two items of match evidence can (a) formulate a new challenge for imprecise probabilism and (b) show that this challenge can be addressed by higher order probabilism. (a) The new challenge for imprecise probabilism is how to do updating given two pieces of evidence. Would point-wise updating on all the probability functions within a set do? It seems there will be problems with that. So the section should describe these problems. Currently the section talks about the problems with using intervals, but not with sets of probabilities. So this needs to be reworked but it should not be too difficult to do. (b) The next step is then to show that higher order probabilism fares better along the lines already described in the section. So that part should be quick to do, just using existing materials.

REPLY: modified the section.

marcellodibello commented 10 months ago

I had a look at the changes. The main issue now is readability, structure and cohesiveness of the different parts of the paper. I think that some sections should shortened significantly and made more coherent with the rest of the paper (e.g, the section on Bayesian networks and the second on two items of evidence)

Here are more specific comments:

  1. The accuracy section is still messy and hard to follow. This is what Nikodem is assigned to fix, right?

  2. The section about the two items of evidence is now, to some degree, better integrated with the rest of the paper, but it is still rough and disjoint from the rest. In particular, the section does not discuss imprecise probabilism if only very briefly. It mostly criticizes the interval approach (using upper or lower values), but the interval approach is not talked about in the proceeding sections. So there is still lack of coherence between this section and previous sections.

  1. The section about higher order Bayesian networks could probably be shortened significantly. The issue with this section is again, how does it fit with the rest of the paper? Perhaps we can say that -- once again -- the imprecise probabilist would not be able to use Bayesian networks. If the imprecise probabilist cannot update by two items of items, a fortiori it will be unable to to update with multiple items of evidence and keep track of complex bodies of evidence. So this would show that imprecise probabilism is really an unworkable theory. Then, we can show that, instead, high probabilism can use Bayesian networks. I think this point can be made quickly and might not need and self-standing section. (There is no need to have a fully worked out example).
marcellodibello commented 10 months ago

One addition to my earlier comments. I found a (very formal!) formal paper that discusses imprecise Bayesian networks: https://arxiv.org/abs/2302.09656 I have not read it, but looks imprecise Bayesian networks cannot be easily dismissed... See also this, early on: https://www.sciencedirect.com/science/article/pii/S0004370296000215

marcellodibello commented 10 months ago

I revised section 2 (precise versus imprecise probabilism) of the higher order probability paper. I will work on section 3 on higher order probabilism this week.

A few more comments (see in particular question in point 3 below and also point 4):

1) The discussion of the compatibility objection in section 2 has not beed revised. I think it might go somewhere else. I still need to think where is the best place to have that discussion.

2) Having removed for now the bit about incompatibility, section 2 only makes three points against imprecise probabilism: (a) imprecise probabilism cannot handle uneven bias; (b) suffers from belief inertia and (c) there are no proper scoring rules for imprecise credences. I think this is already a strong case against imprecise probabilism. The goal of section 3 will be to address these three specific points using higher order probabilism. Working on section 3 this week.

3) I have been thinking about Joyce's response to belief inertia. He says that if you start with knowing nothing, then you learn nothing. I think this response as formulated is ambiguous between two readings: you know nothing at all or you know nothing about a specific proposition but know a lot about other things. So I wonder which of the following is true: (*) if you know nothing at all about anything, then you cannot learn anything (this seems true to me; does higher order probabilism capture this intuition? what about imprecise probabilism?) (**) if you know nothing at all about proposition A (but know something about other propositions), then you will never learn anything about A (this seems a problem; is this what is meant by belief inertia. I suppose imprecise probabilism falls into this problem, but higher order probabilism does not, is this right?) Might be good to make that distinction in response to Joyce in section 2 when we talk about belief inertia. @rfl-urbaniak feel free to add this kind of response if it makes sense in section 2.

4) On the issue of incompatibility, I think that work of Sarah Moss on imprecise probabilism might need to be cited. Can @rfl-urbaniak or @Niklewa take a look at Sarah Moss work on imprecise probabilism and see if it is relevant for our work? I am referring to this paper: Global Constraints on Imprecise Credences: Solving Reflection Violations, Belief Inertia, and Other Puzzles by Sarah Moss

marcellodibello commented 9 months ago

So I made further revisions. Here are the main changes:

Moving forward. I think the paper will just need to have another section on "evidence aggregation" (which will deal with aggregation of multiple items of evidence and complex bodies of evidence and then mention higher order bayesian networks.). We should probably keep all this fairly general without getting into too specific legal examples since this is a paper for a philosophy journal.

marcellodibello commented 9 months ago

Started working on the section on evidence aggregation. there is a lot of work to do with this section. it needs to fit with the argument of the paper and currently it does not fit well with the rest. it was written for another purpose so it needs to be rethought substantially. i am thinking that we should make a few distinct points:

1) precise probabilism can handle simple aggregation (independent items of evidence), but is not expressive enough.

2) imprecise probabilism cannot handle simple evidence aggregation. why not? is the problem that the uncertainity will get larger and larger as we update by more evidence?

3) high order probabilism can handle aggregation.

Then, I guess, we should repeat the same three points in the case of complex cases of evidence aggregation using Bayesian networks.

marcellodibello commented 9 months ago

Made further changes to the section on evidence aggregation. I have cut a lot of it and am trying to make it fit with the rest of the paper, though I am having trouble. I am wondering whether that section really raises a new challenge for imprecise probabilism or simply illustrate, with a real example, problems that were discussed in earlier sections. I will work some more on it, but I am feeling a bit discouraged.

rfl-urbaniak commented 9 months ago

Hey @marcellodibello , please don't get discouraged. With Nikodem we've been fighting an external sprint deadline @Basis (check out this minimal viable product here, need to access from a computer, doesn't work on a phone yet:

beacon.basis.ai

)

But this is almost over and I don't plan longer travels till January. My plan is block some book time and stick to it, making steps forward!

rfl-urbaniak commented 9 months ago

I now even set up a dedicated book laptop to keep me focused!

1) Re: dropping compatibility

I think it is an important objection, as it suggests that the imprecises, deference aside, have no clear story about how beliefs as represented by sets of probabilities are to be formed in light of evidence.

2) Re: Joyce's reply, but you don't know nothing. You know there is a single latent variable, you know it's Bernoulli trials, etc.

Will pay attention when reading Section 2.

3) Re: compatibility. Will read Moss's paper and we'll discuss this.

4) Re: aggregation, I think you're right that the early formulation and the stuff later are in some sense one big problem.

Ayway, I'll read all up to the aggregation section and report.

marcellodibello commented 9 months ago

You are right, @rfl-urbaniak, I should not get discouraged! You plan sounds good. I've accepted a later meeting time moving forward so it it not so early for you.

On compatibility---yes, it is back in the paper. I agree it is important.

rfl-urbaniak commented 9 months ago

My comments to the most recent version:

https://www.dropbox.com/scl/fi/sj8wi36ftyyew7h0mfsad/imp_philosophical2023-11-24.pdf?rlkey=pvnl10scep5dqw4pmjhsrntuo&dl=0

marcellodibello commented 9 months ago

Addressed all the comments by Rafal ion dropbox file up to section section 4 (higher order probabilism). I am now working on section 6 on evidence aggregation. I noticed @Niklewa is working on the accuracy section 5 on another branch, so I will not touch that section for now.

marcellodibello commented 9 months ago

Comments on the accuracy section. The section is much improved but I still don't quite follow the argument, probably because of my own limitations.

  1. Here is a general suggestion on structure:

Start with the brier score-the average squared distance between a true outcome (1 or 0) and a probability (forecast) over as number of prediction instances. This seems easy. Might be good to have a statement of the brier score as a starting point. Might also be good to have a short argument that brier score is a proper scoring rule. So the reader understand what it is means to be a proper scores.

Next, show the following: there is a proper scoring rule for higher order probabilism. I think this claim should be made quickly and upfront. Start by staying the brier score and show that it is a proper score. Then, state formally the proposed score rule for higher order probabilism, highlighting analogies and differences with brie score and finally show it is a proper scoring rule (informally, and refer to appendix for details).

  1. Lots of the section is devoted to making another point, other than there is a proper scoring rule for higher order probabilism. The section devotes quite a bit of time trying to show that we can vindicate the intuition that a certain distribution in the two coin example is better than others (faithful bimodal is better than wide bimodal). This is a a related, but separate point, from the one of proper scoring rule and I would keep the two clearly separate to avoid confusing the reader.

  2. On CRPS. I know this is a preliminary to formulating the proper scoring rule in higher order probabilism, but I wonder whether this discussion is distracting. In case you want to keep it, I have a question. How does CRPS differ from brier score? Is the difference ONLY that for brier score we have a discrete probability distribution, while for CRPS we have a continuous distribution? Is this the only difference? (Since later we use a discretized grid approximation, then it looks like we are back to the case of discrete distribution, right? So are we then effectively using the brier score again?)

  3. The use of KL diverge is introduced very suddenly with no explanation. It seems quite different from CRPS, so it needs some motivation for why we are using it.

  4. Next we move to the example with the two coins whose distributions are centered at .03 and .05. It is not clear at this point, what we are trying to do? Note that here it is not clear how this relates to the issue of proper scoring rule which seems to be the core of the section. There are a number of distributions that could represent the two coins -- okay -- and we are trying to see which of those distributions is the "best". But are we measuring the distance between these distributions and a series of true outcomes? Are we imagining of tossing these coins multiple times and seeing how good those distribution come close to the series of T and H that we actually get? Is this what we are doing? If so, do we need an argument that one of those pairs if distribution is more accurate than the others? We simply assume that faithful bimodal is more accurate, but how do we know that?

  5. Related question: how do we tie the discussion of the two coins back to the case of the brier score (which measures the distance between predicted probability and true outcome)?

  6. We say there are six inaccuracy scores -- is this because there are six true possible true outcome H1 H2, H1 T2, T1 H2, T2 T2? So we need to apply CRPS six times? This seems right. Suddenly, however, we are not measuring the distance between predicted probabilities and true outcomes, but we are looking at an expected inaccuracies? This seems plausible, but can we say something to motivate this move to expected inaccuracy? How does it compare to the standard procedure used for the brief score in precise probabilism? Are we using expected inaccuracies there too? Wouldn't be goo to show the continuity a bit more clearly, emphasizing the role expected inaccuracy in precise probabilism as well?

  7. I do not complete follow the formula at the bottom of page 7. Is there are 6 CRPS score -- depending on the true six true outcome with two coins -- should we adding six terms in that formula, where the expected values are the values for H1H2, H1T2, T1H2, T1T2? Why are we only adding two terms, with expected values for H and T only? This would work for the case of one coin only, but not for the case of two coins we are considering. Am I missing something?

  8. I do not fully understand "our proposal". We says: "Rather than measuring inaccuracy in relation to “true states of the world” conceptualized as two omniscient credences that peak at either 0 or 1 and then averaging using expected values, we should instead utilize a set of N potential true probability hypotheses". Are we measuring the distance between a given probability distribution and "N potential true probability hypotheses"? Can this be explained a bit more? What are thee true probability hypotheses? What do they represent? Do they represent true chance hypotheses (say, hypotheses about what the actual bias of the coin is)?

  9. It'd be good to have everything stated in the fully continuous version (and then, only i footnote, have the grid approximation version). I do not think the reader should be distracted with grid approximation in the main text.

  10. Looks like that, in the end, we go with a version of KL diverge not CRPS, right? But where is the statement of the proper scoring rule using KL divergence? The section only gives the statement using CRPS (0. 12).

rfl-urbaniak commented 9 months ago

add a paragraph in the beginning explaining the dialetics

add a footnote about Brier score

explain why CRPS is for cumulative

add a few more words of motivation for KL

be more clear about I being a place-holder and KL and CRPS is

pump up the factory story, make the distinction between the objective generative process, you want the subjetive distro matching it

add a footnote ideally about the continuous case

marcellodibello commented 9 months ago

Revised section 6 on evidence aggregation (the simple case) and then paired it with section 7, also on evidence aggregation (the complex case, higher order Bayesian networks). I mostly worked on section 6.

So you can have a look at section 6. I have a question about the overestimation and underestimation objection, see comments on the margin on pp. 17 and 18. Have a look and see what you think.

Niklewa commented 9 months ago

revised and then worked with Rafal on the section on accuracy and the appendix on the accuracy, feel free to reread it. The key points listed in the discussion above have been implemented, please double-check if your head hurts less when you read these sections now.

marcellodibello commented 8 months ago

I had a look at the appendix and the accuracy section. They have been significantly improved and close to be done as far as I can tell.

The appendix looks good to me. I do not understand it--my fault--but the structure and explicit point about what is common knowledge and what is novel is good. I am going to try to understand it, but I think it looks good.

The section on accuracy could benefit from some more explanations, but it is getting there I think. Specific suggestions:

  1. Give a quick, toy-like example of the calculations of the two distance measures. For example ,suppose the true generative mechanism is a normal centered at 1 (say this is p) and you think the mechanism is a normal centered at 0.5 (this is q). How is the difference between the two computed? Or maybe more relevant to what is discussed, suppose the true generative mechanism is a two peaked distribution, but you think it is a normal centered at .4. How is the difference between the two computed? Something like that.

  2. Before getting to expected inaccuracy and proper scores, I have a preliminary question. If one goal here is to say that faithful bimodal is close to the true generative process (in fact it is identical to it), then this should simply follow by applying any distance measure between distributions. If the bimodal is the true generative process, then the bimodal will be close to it. Any distribution is the closest to itself, no? So what is all the fuss? I understand the point about proper scores (which is another, much less trivial thing), but the point about faithful bimodal being more accurate seems trivially true, no? Unless you are arguing that faithful bimodal is more accurate relative to some other distribution, say a distribution of outcomes 1 and 0 or something like that? What is the claim exactly?

  3. Expected inaccuracy: looks like the subscript is off. Minor thing!

  4. Expected inaccuracy, p. 10, needs more intuitive explanation. The assumption is that some distribution, say p, is the distribution that putatively represents the agent credal uncertainty about something, while q is the true objective distribution, right? So you are measuring the distance between p and q relative to the true q distribution? I am getting it right? Or it is the case that sometimes you want to measure the accuracy of p relative to p itself (especially when you test expected inaccuracy as a proper score)? Is this right? So you can measure the expected inaccuracy of p relative to q or relative to p itself? If this is right, it might be good to make this point, also for the discussion that follows about proper scores. If this is wrong, uhm, then I don't follow completely the argument....What does it mean to measure the accuracy of p relative to itself? Is it the expected inaccuracy of p relative to p, right?

  5. Now, on expected inaccuracy, the binary example. One thing that is not clear to me is why the expectation E also occur in the right hand side of equality sign. I am looking at the general defintion of expected inaccuracy in the page before: should it be just q(heads) and q(tails) instead of Eq(heads) and Eq(tails)? This should be the natural way to apply the more general defintion, no? But see next comment below.

  6. You say that taking the expectation Eq(heads) and Eq(tails) could be simpler, and then you go on giving an elaborate argument to do that. But this seems convoluted to me, possibly confusing. What you ended up doing is to take the expected accuracy of distribution p relative to p itself -- that is, the flattened out version in terms of the expected values. So you are essentially measuring the expected inaccuracy of p relative to the expected values of p. Is this correct? Unless someone is arguing for using such a measure, I would not give it too much prominence. I would cut some of pages 11, 12 and the top of 13.

  7. I would go straight to the application of the expected inaccuracy measure -- "our proposal" in the middle of pages 13 using the Schoenfield example.

  8. The part that is really key here is this:

    '"we should instead utilize a set of n potential true probability hypotheses (ideally, going continuous, but we’re working with a discrete grid of n = 1000 possible coin biases in this paper). We then compute all the inaccuracies with respect to each of these𝑛values represented by “omniscient” distributions (or true chance hypotheses) and determine the expected inaccuracy scores using the entire distributions rather than relying solely on the expected values of the distributions"

This paragraph...I do not fully understand. I have read it a few times, and I am getting a sense of where you are going and it seems right, but can you give an example of how this works? Just give the intuitions. A few lines should be enough.

  1. More specific question on the same paragraph. Since we are measuring the expected inaccuracy of p relative to q, what is our p and is our q here? You say there are 1000 possible coin biases. Will these 1000 biases take the place of the true/false w (=1 or 0) in the defintion of expected inaccuracy on page 10 bottom? What would be q(w) in this new approach? Would that be p(w) since we are measuring the expected inaccuracy of p relative to p? So we are just looking at the probabilities that p itself assigns to the various possible 1000 biases? I am confused...I think this just needs to be explained more clearly.
rfl-urbaniak commented 8 months ago
  1. Give a quick, toy-like example of the calculations of the two distance measures. For example ,suppose the true generative mechanism is a normal centered at 1 (say this is p) and you think the mechanism is a normal centered at 0.5 (this is q). How is the difference between the two computed? Or maybe more relevant to what is discussed, suppose the true generative mechanism is a two peaked distribution, but you think it is a normal centered at .4. How is the difference between the two computed? Something like that.

TODO: NIKODEM, MAKE SURE THIS CAN BE USED IN AN ONGOING EXAMPLE

  1. Before getting to expected inaccuracy and proper scores, I have a preliminary question. If one goal here is to say that faithful bimodal is close to the true generative process (in fact it is identical to it), then this should simply follow by applying any distance measure between distributions. If the bimodal is the true generative process, then the bimodal will be close to it. Any distribution is the closest to itself, no? So what is all the fuss? I understand the point about proper scores (which is another, much less trivial thing), but the point about faithful bimodal being more accurate seems trivially true, no? Unless you are arguing that faithful bimodal is more accurate relative to some other distribution, say a distribution of outcomes 1 and 0 or something like that? What is the claim exactly?

TODO: emphasize how accuracy is wrt. to the true outcome. There is only one coin with a single bias. What's your notion of accuracy wrt to such things? MARCELLO LATER?

  1. Expected inaccuracy: looks like the subscript is off. Minor thing!

TODO: Nikodem fix this please.

  1. Expected inaccuracy, p. 10, needs more intuitive explanation. The assumption is that some distribution, say p, is the distribution that putatively represents the agent credal uncertainty about something, while q is the true objective distribution, right? So you are measuring the distance between p and q relative to the true q distribution? I am getting it right? Or it is the case that sometimes you want to measure the accuracy of p relative to p itself (especially when you test expected inaccuracy as a proper score)? Is this right? So you can measure the expected inaccuracy of p relative to q or relative to p itself? If this is right, it might be good to make this point, also for the discussion that follows about proper scores. If this is wrong, uhm, then I don't follow completely the argument....What does it mean to measure the accuracy of p relative to itself? Is it the expected inaccuracy of p relative to p, right?

TODO: NIKODEM CONTINUES THE RUNNING EXAMPLE

  1. Now, on expected inaccuracy, the binary example. One thing that is not clear to me is why the expectation E also occur in the right hand side of equality sign. I am looking at the general defintion of expected inaccuracy in the page before: should it be just q(heads) and q(tails) instead of Eq(heads) and Eq(tails)? This should be the natural way to apply the more general defintion, no? But see next comment below.

  2. You say that taking the expectation Eq(heads) and Eq(tails) could be simpler, and then you go on giving an elaborate argument to do that. But this seems convoluted to me, possibly confusing. What you ended up doing is to take the expected accuracy of distribution p relative to p itself -- that is, the flattened out version in terms of the expected values. So you are essentially measuring the expected inaccuracy of p relative to the expected values of p. Is this correct? Unless someone is arguing for using such a measure, I would not give it too much prominence. I would cut some of pages 11, 12 and the top of 13.

TODO RE 5, 6: explain how this is an example of being stubborn and sticking to a simple inaccuracy measure; how this simplification is too far and how it fits into the structure

  1. The part that is really key here is this:

    '"we should instead utilize a set of n potential true probability hypotheses (ideally, going continuous, but we’re working with a discrete grid of n = 1000 possible coin biases in this paper). We then compute all the inaccuracies with respect to each of these𝑛values represented by “omniscient” distributions (or true chance hypotheses) and determine the expected inaccuracy scores using the entire distributions rather than relying solely on the expected values of the distributions"

This paragraph...I do not fully understand. I have read it a few times, and I am getting a sense of where you are going and it seems right, but can you give an example of how this works? Just give the intuitions. A few lines should be enough.

TODO: Nikodem.:P

  1. More specific question on the same paragraph. Since we are measuring the expected inaccuracy of p relative to q, what is our p and is our q here? You say there are 1000 possible coin biases. Will these 1000 biases take the place of the true/false w (=1 or 0) in the defintion of expected inaccuracy on page 10 bottom? What would be q(w) in this new approach? Would that be p(w) since we are measuring the expected inaccuracy of p relative to p? So we are just looking at the probabilities that p itself assigns to the various possible 1000 biases? I am confused...I think this just needs to be explained more clearly.

TODO: NIKODEM example one-pager for MARCELLO, first run by RAFAL

marcellodibello commented 8 months ago

added further comments in the margins to the accuracy section. the section has improved but i think it is difficult for the reader to follow what is going on. i read the proof in the appendix and it seems much clearer and to the point.

looks like that the the accuracy section should show that:

1) there is a proper scoring rule for higher order probabilism.(this is done in the appendix quite well but it is not clear to me what the section really does except showing that there is a mistaken way to think of a scoring rule that is not proper. maybe this is an objection, but why should it cover the whole section?)

2) the other task of the section -- which seems important but I do not know where this is done -- is to show that acciracy sometimes recommends an imprecise credence.t his accuracy point is missing. where is it discussed?

Another point:

3) I think the section on accuracy should proceed as follows. First, outline the scoring rule for higher order probabilism using the KL divergence, define expected inaccuracy, and then sketch why this scoring rule is proper. Use the coin example from Shoenefeld to illustrate why the rule is proper using the numbers in the table. Next, explore an alternative way to define the scoring rule---one that assumes only two outcomes---and show that this is improper. Currently, the improper scoring rule takes over the whole section and it is confusing for the reader why they should engage with something that is ultimately wrong.

You could start with the improper scoring rule but you need to make it more compelling to the reader why they should take it seriously.

marcellodibello commented 8 months ago

reviewed the section on evidence aggregation, it seems more or less fine to me although ti might need some work to do.

there are two things that needs addressing:

1) at the start we say that going higher order is recommended by accuracy consideration, but we never why that is in other parts of the paper. so this needs filling in.

2) towards the end -- see note in the margin -- the section says that aggregating evidence using precise probabilism will either over or under estimate the value of the evidence, this seems right, but we do not really have a strong argument to show that. i was wondering if aggregating two items of evidence using their precise LR leads to a certain combined LR ratio, whether aggregating the distribution and then taking the mean of the joint distribution leads to a different LR. or is it the same LR in both cases? Anyway, to be discussed.

rfl-urbaniak commented 8 months ago

reviewed the section on evidence aggregation, it seems more or less fine to me although ti might need some work to do.

there are two things that needs addressing:

1. at the start we say that going higher order is recommended by accuracy consideration, but we never why that is in other parts of the paper. so this needs filling in.

Hm, perhaps we should rephrase: it is recommended by it being the proper way to make the framework evidence-responsive and honest about uncertainty. E.g. in cases where the median is high enough but uncertainty is wide (such as Sally Clark, if I remember well), if we were to be guided by point estimates, we would convict, but clearly this is not sufficiently evidence responsive).

2. towards the end -- see note in the margin -- the section says that aggregating evidence using precise probabilism will either over or under estimate the value of the evidence, this seems right, but we do not really have a strong argument to show that. i was wondering if aggregating two items of evidence using their precise LR leads to a certain combined LR ratio, whether aggregating the distribution  and then taking the mean of the joint distribution leads to a different LR. or is it the same LR in both cases? Anyway, to be discussed.

I definitely do not want to make the paper even more convoluted by some additional LR-in-multiple-approaches calculations. I wonder if what I said in reply to your previous comment couldn't be made more visible.

rfl-urbaniak commented 8 months ago

Read the aggregation section.

marcellodibello commented 7 months ago

Made significant revisions to section 6 on evidence aggregation (simple case). The section seems to be fine, more or less, perhaps it needs to be shortened. But I have to requests:

  1. I think we need a general formula that shows how two items of independent match evidence (and their higher-order uncertainties) are aggregated. In precise probabilism, you multiple the LRs, in higher order probabilism what do you? We have a Figure that shows a join higher order distribution, but what is the formula that gives that distribution. Just ,multiply LRs whose terms are densities and not precise probabilities? Is that it?

  2. I want to know whether, for evidence assessment, higher order probabilism and precise probabilism will come into conflict. is the case that the the combined LR of two match evidence (LR1XLR1) in precise probabilism is different from the mean of the joint joint density of the two matches?

I still need to work on the final section 7: evidence aggregation (complex case).

marcellodibello commented 7 months ago

I am looking at the accuracy section and the appendix. Before I revise, there are some conceptual issues I need to get clear on. The section is still hard to follow. I can make the revisions but I need to understand what is going with the argument.

  1. First a brief recap: the inaccuracy measure being proposed reduces (p. 27, appendix) to 1/log(p_k) or -log(p_k), where p_k is the value that the higher-order probability measure p assigns to the true outcome, say the probability value assigned to the true chance hypothesis or bias of the coin. If the true bias is 0.7, then p_k would be the probability assigned to the bias being 0.7. So, if p1 assigns probability 0.2 to the true bias and another measure p2 assigns probability to the true bias, then p2 would be less inaccurate than p1. This should all be straightforward, but let me know if I am wrong.

  2. The account in point 1 rests on a crucial assumption, namely that the true bias (or more generally, the true chance hypothesis) must have a probability of one. In other words, the true distribution is assumed to be a point distribution. This is a necessary step in the proof in the appendix (see middle of page 27) that allows the divergence measure to be reduced to just -log(p_k). But what if the true generating mechanism is actually a bimodal distribution, or a distribution that does not peak at a certain value? So suppose t(x) is the true distribution over certain chance hypotheses (not a point distribution, but multimodal) and p(x) is the evidence-based higher-order distribution available to us about the same range of chance hypotheses (also possibly not a point distribution). In this case, I do not think that inaccuracy of p(x) would reduce to -log(p_k), simply because there is no single true outcome to being with. So I am wondering about the justification of the assumption that the true distribution must be pointed.

  3. I am also hesitant about the proposed inaccuracy measure of -log(p_k). Suppose one higher-order distribution p1 assigns 0.7 to the true chance hypothesis (say, the coin is fair) and the rest of its mass is on another chance hypothesis (say, the coin comes up only heads). Compare this with another higher-order distribution p2 which also assigned 0.7 to the true chance hypothesis, but whose mass evenly spreads everywhere else. The distribution p1 and p2 are quite different, yet -log(p1_k) and -log(p2_k) would be the same, right? Isn't this counterintuitive? After all, p1 and p2 have different shapes.

  4. In the main section, towards the end, there is this sentences:

"To make sure that this favorable outcome isn’t due to not using pointed credences, we can redo the calculations using the pointed version. In the pointed version, all the focus is on 0.4, or the weight is evenly divided between 0.3 and 0.5, or between 0.2 and 0.6. As anticipated, when we consider inaccuracy, both of these setups recommend the bimodal version (Table4)"

I do not understand what is going on here. What is the "pointed version"? As I understand, you first calculated the inaccuracy score of the faithful bimodal distribution relative to two true chance hypotheses, H3 for which the true chance is .3 and H5 for which the true chance .5, right? Given either of those, the faithful bimodal is more accurate. What is then the the stuff about "pointed version"? What does that mean?

marcellodibello commented 7 months ago

Also, I think there is something odd in the proof in the appendix, page 27. The probability pk is sometimes used as a fixed value, and sometimes as a variable. The quantifiers are not completely clear to me. The formula for the entropy usually is - \sum p_i log(p_i) because you are summing over all the possible values of p_i. If you write instead - \sum p_i log(p_k), it seems as though p_k is a fixed value, thus the formula would reduce to p_k log(p_k), which I don't think it is what you want. So maybe the notation has to be made clearer.

marcellodibello commented 6 months ago

I have been giving some thought to the accuracy section. I have a few worries about the propriety of the higher-order scoring rule that is defended in the section:

  1. I am not sure in what way the proposed scoring rule---which boils down to -log(p) (see my earlier comments)---is a "higher-order" scoring rule. Suppose I assign a .7 probability to the outcome rain and it does in fact rain. Then, I could take --log(.7) to be my score. This would not be a higher-order score. Or I could take the brier score (1-.7)^2, again not a higher order score. And both scores would be proper, right?

  2. Perhaps -log(p) can become a higher order score depending on the particular p we select. We can proceed as follows. First, determine the true chance hypothesis or true probability, call it \theta. Next, see what probability is assigned to theta, cal this probability p_theta. Finally, take -log(p_theta) in this way. So the probability p_theta is actually a higher-order probability, because it is the probability assigned to theta (basically, our first-order probability). So is -log(p_theta) higher order in this sense? And is it correct to say that any other -log(p) score will necessarily not be a higher-order score if, for example, p is the probability assigned to the first-order outcome "rain"?

  3. If point 2 is correct, I wonder why we cannot use the Brier score as a higher-order score as well. Say theta is the true probability, the true state of the world to which we assign value 1. And let p_theta the probability we assign to this true state of the world theta. Then, we take (1-p_theta)^2 to be our score. Would that count as a higher-order Brier score? I don't see why not, or I am missing something? Would (1-ptheta)^2 be a proper score? It seems it would, right? So what, exactly, are we gaining by using -log(p\theta)?

  4. One of the requirements in the literature is that an accuracy score satisfies extensionality, namely, that the score is a function of two things: first, the true state of the world, and second, the belief state/credences. Clearly, both the Brier score and the log-style score do satisfy this condition. But if we use them as higher-order scores, they do not seem to satisfy this condition any longer, because they no longer depend on the (first-order) state of the world. They only depend on the parameter or true probability \theta. Now, one might argue that theta is also a true state of the world (the true chance hypothesis or something like that), but if that is true, then in what way are the scores we are using higher-order scores? Perhaps what we are doing is simply to drop extensionality?

  5. How does the proposed higher-order score related to the first -order score? Do we need to combine the two to assess the accuracy of the agent's credence , or the higher-order score will be enough? Does that first-order score fall off from the second-order score or the two independent?

  6. I am not clear about the claim that imprecise scores are improper. The Seidenfeld et al paper where this claim is made only argues that imprecise scores using intervals are not proper, but the paper also proposes an alternative, lexicographic imprecise score that is proper. It seems that other papers published after that relax certain assumption and also provide proper scores for imprecise probabilities. So I am worried that reviewers will object to the claim about the impossibility of proper scores for imprecise probabilities---that claim just isn't true because it seems too generic.

Niklewa commented 6 months ago

for the interval 0.015 - 0.0317

" Posterior interval (starting with 1:1 prior odds) was initially `r (1/.015)/(1+1/.015) - (1/.037)/(1+1/.037) "

marcellodibello commented 6 months ago

revised entire accuracy section, made substantive changes. things to look for:

marcellodibello commented 6 months ago

made revisions all throughout the imprecision paper. substantive revisions to the section on evidence aggregation. getting close to a final version.

rfl-urbaniak commented 6 months ago

The march comments

https://www.dropbox.com/scl/fi/2dx94mz4rd3vyw8at41h5/imp_philosophical_march2024.pdf?rlkey=i2ncx8m3lrax8m6s7hj62x19l&dl=0

Niklewa commented 6 months ago

I have conducted a small experiment to assess whether in a coin scenario the bimodal distribution is indeed the most profitable choice. I simulated a bag of coins and sampled 1000 coins from it, measuring the distances from each coin's bias and the distribution. It turns out that the bimodal distribution wins in two settings, while the wide bimodal wins in the other two. The code is in a separate R script named practicallExperiment.R, located in the folder with the Quarto files. The matrix of successes (where a distribution has the lowest inaccuracy) looks like this:

Method bimodal bimodalWide centered
cvm 545 155 300
kld 0 1000 0
cvm_cumulative 10 990 0
kld_cumulative 533 22 445
marcellodibello commented 6 months ago

small changes made to sarah moss discussion of belief inertia.

marcellodibello commented 6 months ago

was thinking about simplifying further final section on multiple events and evidence aggregation. i started to play around with things. what we should do perhaps is this:

PRELIMINARY PROBLEM: started to play around with things, but this is what I found. Odd outputs of R code computations for the coin densities for fair, even and uneven coins:

mean(TwoUnbalancedDensity) [1] 0.000999001

mean(TwoDensity) [1] 0.000999001

what seems strange to be is that the mean of the densities is 0.000099 and should be 0.5, no? i am missing something?

We do get some clear divergence (in expectation) between precise and higher probabilism, as this below shows:

mean(TwoDensity)*mean(TwoDensity) [1] 9.98003e-07

mean(TwoDensity*TwoDensity) [1] 0.0004995005

mean(TwoUnbalancedDensity)*mean(TwoDensity) [1] 9.98003e-07

mean(TwoUnbalancedDensity*TwoDensity) [1] 0.0004995005

marcellodibello commented 6 months ago

I have conducted a small experiment to assess whether in a coin scenario the bimodal distribution is indeed the most profitable choice. I simulated a bag of coins and sampled 1000 coins from it, measuring the distances from each coin's bias and the distribution. It turns out that the bimodal distribution wins in two settings, while the wide bimodal wins in the other two.

This is interesting -- what are we supposed to make of this result? It is an objection against higher-order probabilism?

marcellodibello commented 5 months ago

@Niklewa -- I do not see changes in the notation in the accuracy section, only minor changes in the appendix. Maybe you forgot to push the changes you made?

rfl-urbaniak commented 5 months ago

was thinking about simplifying further final section on multiple events and evidence aggregation. i started to play around with things. what we should do perhaps is this:

* discuss the joint probability of the event "coinA AND coin B come up, say, heads"? and ask this question for different coins (fair coin, even bias coin and uneven bias coin and all possible combinations).

* show that precise, imprecise and higher order probabilism will give different assessments of the joint probabilities and show that higher order probabilism gives better assessments

* this way there will be continuity with coin examples of the earlier section

PRELIMINARY PROBLEM: started to play around with things, but this is what I found. Odd outputs of R code computations for the coin densities for fair, even and uneven coins:

mean(TwoUnbalancedDensity) [1] 0.000999001

mean(TwoDensity) [1] 0.000999001

what seems strange to be is that the mean of the densities is 0.000099 and should be 0.5, no? i am missing something?

We do get some clear divergence (in expectation) between precise and higher probabilism, as this below shows:

mean(TwoDensity)*mean(TwoDensity) [1] 9.98003e-07

mean(TwoDensity*TwoDensity) [1] 0.0004995005

mean(TwoUnbalancedDensity)*mean(TwoDensity) [1] 9.98003e-07

mean(TwoUnbalancedDensity*TwoDensity) [1] 0.0004995005

I think restricting to coins makes this somehow less engaging and somehow less realistic, which I don't like. As for your calculations, I can't tell from the comment how exactly this was calculated. Happy to go over code and discuss tomorrow.

rfl-urbaniak commented 5 months ago

The bits needed to revise the appendix:

\noindent Now, let's think about expected values. First, what is the inaccuracy of a distribution $q$ as expected by $p$, $\mathit{EI}_{\text{DK}}(p,q)$?

\begin{align} \mathit{EI}{\text{DK}}(p,q) & = \sum{k =1}^n pk I{\text{DK}}(q, \thetak) \ & = \sum{k =1}^n pk \sum{i=1}^n Ind^k_i \left( \log_2 Ind^k_i - \log_2 qi \right)\ & = \sum{k =1}^n p_k Ind^k_k \left( \log_2 Ind^k_k - \log_2 qk \right)\ & = \sum{k =1}^n p_k ( - \log_2 qk) \ & = - \sum{k =1}^n p_k \log_2 q_k = H(p,q)\ \end{align}

Now what happens if $p$ is $q$?

\begin{align} \mathit{EI}{\text{DK}}(p,p) & = - \sum{k =1}^n p_k \log_2 p_k = H(p) \end{align}

rfl-urbaniak commented 5 months ago

the higher-order scoring rule we propose is based on a well-known measure of divergence between probability distributions, the Kullback-Leibler (KL) divergence, which is defined as follows: $$ D{\text{KL}}(p \,||\, q) = \sum{x} p(x) \log\left(\frac{p(x)}{q(x)}\right) $$

\noindent where $x$ ranges over all hypotheses under consideration (i.e. the elements of the sample space). This is a standard information-theoretic measure of divergence of $p$ from $q$ from the perspective of $p$.\footnote{In the continuous case, we would need to use the so-called differential KL divergence.} In our particular case, we will be dealing with a finite array of evenly spaced hypotheses $\theta_1, \dots, \theta_n$ about what the true chance is (this is our discretization). Each particular hypothesis $\theta_k \in [0, 1]$ is associated with an omniscient distribution tracking the true state of the world (i.e. the true chance being $\theta_k$). We denote it by $Ind^k(\cdot)$ so that $Ind^k(\theta_i)$ is 1 if $i=k$ and $0$ otherwise. We write $Ind^k_i$ instead of $Ind^k(\theta_i)$.

With this notation we can now formulate our notion of inaccuracy of $q$ if the true hypothesis is $\theta_k$. It's the KL divergence between $q$ and $Ind^k$.

\begin{align} I_{\text{DK}}(q, \thetak) = D{KL}(Ind_k\vert \vert q) \end{align} \noindent As shown in the appendix, this boils down to $-\log_2 p_k$.

For instance, imagine the possible outcomes being the chance hypotheses $\theta_1, \theta_2, \dots, \theta_n$ about the true bias of a coin. If, for example, the true bias of the coin is $.6$ and the higher-order distribution $p$ assigns $.8$ to this bias, the higher-order inaccuracy score of $p$ would be $-\log .8$. Notice that, on this approach, two distributions $p$ and $q$ which assign the same probability to the true chance hypothesis will have the same inaccuracy score even though they might differ in the probabilities they assign to other chance hypotheses. So the shape of the distribution does not matter for the inaccuracy score; it does matter for expected inaccuracy, as we will soon see.

We will now give a higher-level description of the proof of strict propriety of the scoring rule $I_{KL}$. The first step is to define the score's expected inaccuracy, as follows:

\begin{align} \mathit{EI}{\text{DK}}(p,q) & = \sum{k =1}^n pk I{\text{DK}}(q, \theta_k)\ \end{align} \noindent As shown in the appendix, this is identical to: \begin{align} & = - \sum_{k =1}^n p_k \log_2 q_k = H(p,q)\ \end{align}

\noindent where $H(p,q)$ stands for $\cdots$

In particular if $p$ is $q$ this further boils down to: \begin{align} & = - \sum_{k =1}^n p_k \log_2 p_k = H(p)\ \end{align}

\noindent where $H(p)$ stands for $\cdots$ The only missing part now is Gibb's inequality, according to which $H(p,q) \geq H(p)$ with identity holding only if $p=q$ (we elaborate on why this holds in the appendix as well).

rfl-urbaniak commented 5 months ago

\theta_1, \dots, \theta_n

Ind^k(\cdot) (takes \theta_i)

Ind^k_i [float] = Ind^k(\theta_i)

p_1, \dots, p_n

p(\theta_1), \dots, p(\theta_n)

I_KL(p, \theta_k) = D_KL(Ind^k || p)

EI_DK(p,q)

Niklewa commented 5 months ago

I have made some minor revisions to the accuracy section that we talked about, @marcellodibello.

Additionally, I have better described and cleaned the experiment that I have designed. It would be great if @rfl-urbaniak could take a look at it to check if it makes sense. The file is named practicalExperiment.R and can be found in the quarto paper folder.

Niklewa commented 4 months ago

I have completed those small changes to the paper that we agreed upon yesterday @marcellodibello

marcellodibello commented 4 months ago

continued working on accuracy section, made substantive changes to organization and structure to improve clarity, notation and exposition, still working on it, but making good progress

marcellodibello commented 4 months ago

after working full time on the paper over the past two days, i completed the revisions of higher order probabilism paper

made significant changes to accuracy section and changes to other sections

i am generally happy with the current version except the section about the bayesian network which still fees a bit off, but i am not sure how to improve it

please @rfl-urbaniak and @Niklewa take a loo at the entire paper.

there are a few notes in the margin that need addressing, but not many