How to handle errors - Githubissues

ryanvarley commented 11 years ago

The uncertainties quoted in the literature can be in several formats

< 0.7
0.7 +- 0.1
0.7 + 0.1 - 0.05
0.7889 +- 12

The issue is how to handle these

Currently im treating them as follows

< 0.7 is err='0, -0.7'
0.7 +- 0.1 is err='0.7'
0.7 + 0.1 - 0.05 is err='0.1, -0.05'
0.7889 +- 12 is err='0.0012'

The method used should be simple but most importantly unambiguous.

In the case of 2, is this the best way to handle +- or would '0.7, -0.7' be better?
should posative values have a + for clarity (possible having '+0.1, -0.05' for double value and '0.1' for +-
should errors be quoted in the short '12' format which is easier or the less ambiguous '0.0012'?

Any discussion is welcome

hannorein commented 11 years ago

I've been thinking about this quite a bit. I think your solution is actually pretty close to what I've been wanting to do. A couple of remarks:

It's even more complicated if one wants to include different solutions (i.e. from different groups or from different fitting routines). I think ignoring this for now is ok.
Not to mention if one wants to include the results of a MCMC routine. One would ideally want to include the posterior distribution. But that is obviously an overkill.
Your first example is problematic if it is a lower limit.
I wonder if it's better to quote the value plus the uncertainty (limits), rather than the uncertainty itself. But see next point.
It might be better to have two attributes for lower and upper limits. Something like: 1.0. If a value is just a lower limit and no upper limit is known, one could just write 1.1. If the upper and lower limits are the same, one could write them as an uncertainty 1.0.
In general, I would write "err" as "error". I don't think saving a few letters is worth the loss of readability. It shouldn't matter with respect to filesize when the XML files are compressed.

ryanvarley commented 11 years ago

im trying to use the 'most-correct' or accurate version for the catalogue version

upper and lower limits are good but involve more effort on the part of the inputer to calculate above and below rather than enter directly from the paper. Unless a simple inputer was developed to take this sort of input and output to xml.

I agree the first solution is problematic (and the main reason for this issue). I've been mainly thinking about the entry from a code point of view (i have a python package thats nearly ready that loads the catalogue into classes which includes some more advanced calculations)

Another suggestion is keeping the format as i suggested but using < > to designate the limits in the error or even a upperlimit='true'

Should we use uncertainty as the main tag instead of error or err?

hannorein commented 11 years ago

One problem with using something like error='0.1, -0.05' is that it is not trivial to parse with most libraries because it mixes XML with a comma separated list. I think whatever format we choose, it should allow to simply query the uncertainty of a value using XML alone.

I wouldn't worry too much about how much work it is to enter the data. As you mention, one can just write a script that one runs after entering the data in the most convenient format and converting it to the "standardized" format.

So let's see what the options are:

1.0,

1.0 (symmetric errors)

1.0 I'm actually not sure if this is valid XML syntax as > is a special character. (only upper/lower limit known)
1.0

1.0 (symmetric errors)

1.0 (only upper/lower limit known)
1.0

1.0 (symmetric errors)

1.0 (only upper/lower limit known)

It's clearly a matter of taste. But I prefer 3. I think it is very easy to read (both by humans and by a machine) and it's unambiguous. However, I agree that number 1 is easier to enter. But you could still enter it that way and just write a 5 line python script to convert it to format 3. What do you think?

hannorein commented 11 years ago

Or:

4

1.0

1.0 (symmetric errors)

1.0 (only upper/lower limit known)

ryanvarley commented 11 years ago

I like the idea of 3, data entry is harder but its certainly cleaner. i dont however like the difference in syntax between symmetric and non symmetric uncertainties. I also think turning non symmetric uncertainties into upper and lower limits makes it harder to spot data entry errors as its never immediately obvious.

another possibility

1.0

1.0 (symmetric errors)

but it wouldn't have good handling of upperlimits.

We could also do 3 in a similar fashion eliminating another tag

1.0

1.0 (symmetric errors)

1.0 (only upper/lower limit known)

whilst i dont like the idea of multiple tags having both uncertainty and upperlimit does solve the problems of data entry and validation for many values. I still think id prefer a less tags though, id much rather pull them in with one tag rather than (upperlimit and lowerlimit) or uncertainty.

My previously mentioned code could solve some of these problems on the code end but i think having more tags makes things less universally accessible.

Overall im still unsure.

hannorein commented 11 years ago

With regard to your first comment. I agree, it's not nice to have this additional layer of complexity in the syntax by having three attributes error, errorplus and errorminus. In fact I had just made a test entry of KOI-200 and thought the same. Again, one could simply write a script that takes error="0.1" as an input and outputs errorplus="0.1" errorminus="0.1" to make everything consistent while keeping the entry as simple as possible. Or the other way around, going back to the error="0.1" format if the two error bars are found to be equal. Hm. I'm unsure. Let me ask another colleague of mine for his feelings (I think it's mainly down to feelings rather than anything else now).

I guess if one wants to be really precise, one should distinguish somehow between errors and limits. Because they are not really the same. One is a detection, one is a non-detection. A tag such as 1.0 might imply the mass has been measured. But we really only know the upper limit, thus something like might make more sense. I'm not sure what's better here but I have a tendency towards the second one. And yes, that would imply using two more attributes upperlimit and lowerlimit :-1:.

ryanvarley commented 11 years ago

I like the second one aswell, this way we have a new tag but it is distinguishing between errors and limits which is useful. Ill also pull in a colleague for more input.

mamartinod commented 11 years ago

Hi,

I think 5 is the easiest way to get the error bars in a program. You
have always two var instead of three or more if you use "uncertainty"
etc. Unfortunately, it is a heavy way but whatever the manner, there will
always be a difficult or a heavy part in the chain (xml, code, using).

Cheers,

Marc-Antoine Martinod

Ryan Varley notifications@github.com a écrit :

I like the idea of 3, data entry is harder but its certainly
cleaner. i dont however like the difference in syntax between
symmetric and non symmetric uncertainties. I also think turning non
symmetric uncertainties into upper and lower limits makes it harder
to spot data entry errors as its never immediately obvious.

another possibility

1.0

1.0
(symmetric errors)

but it wouldn't have good handling of upperlimits.

We could also do 3 in a similar fashion eliminating another tag

1.0

1.0
(symmetric errors)

1.0 (only upper/lower limit known)

whilst i dont like the idea of multiple tags having both uncertainty
and upperlimit does solve the problems of data entry and validation
for many values. I still think id prefer a less tags though, id much
rather pull them in with one tag rather than (upperlimit and
lowerlimit) or uncertainty.

My previously mentioned code could solve some of these problems on
the code end but i think having more tags makes things less
universally accessible.

Overall im still unsure.

Reply to this email directly or view it on GitHub: https://github.com/ryanvarley/open_exoplanet_catalogue_advanced/issues/2#issuecomment-16877712

hannorein commented 11 years ago

I talked to Dave Spiegel about it. He brought up another issue, different people define the error bars in a different way (half-width of a Gaussian distribution, dispersion of the posterior distribution, etc). Ideally a flag that indicates which one was used in the paper would be ideal. But maybe this is going too far for the moment.

The good thing is that we seem to agree upon the basic syntax:

For normal error bars: 1.0. For limits: .

I added 5 lines of code to the simple cleanup script (currently on a separate branch) that allows you to enter the error as error="1.0" if it is symmetric. It's then converted to errorminus and errorplus attributes.

Let me know if you agree and if you think that's it or if there's more to talk about!

ryanvarley commented 11 years ago

Yes, i think for now the way people define errors isn't for the catalogue to judge and it should just report the best values with the errors given in that paper.

I'll clean up this branch soon with our new standard and edit the wiki.

Should we keep working on this in this repo or on your branch? We are also adding transittime and logg to planets in this branch.

hannorein commented 11 years ago

I'm very interested in having the transittime and logg data in my repository too. So, yes, please continue sending me pull requests!

Two questions:

You put the transit time underneath the tag. But shouldn't it be under the tag? In other words, different planets in the same system will have different transit times.
What's your motivation for including the logg parameter? Given mass and radius, can I not calculate it?

ryanvarley commented 11 years ago

Unintentional error effecting most (but not all) of the update targets!
You are of course correct - ill remove this aswell

And this issue is closed :-)

ryanvarley / open_exoplanet_catalogue

How to handle errors #2