poseidon-framework / poseidon-schema

An archaeogenetic genotype data organisation file format
0 stars 1 forks source link

Country could be specified more specifically #40

Closed xrotwang closed 1 year ago

xrotwang commented 3 years ago

The definition for Country still seems somewhat unspecific (there are many ways to write down a "Country"). ISO 3166-1 is probably most widely used - and can easily translated to human-readable names using packages like pycountry.

nevrome commented 3 years ago

This is already intended: https://poseidon-framework.github.io/#/janno_details?id=spatial-position

The mismatch between these short definitions and the long explanations is an issue - maybe we should get rid of the short definitions here and reorganize the long ones instead :thinking:

Thanks for going through this, @xrotwang!

xrotwang commented 3 years ago

Ah, ok. I'd still recommend alpha 2 codes rather than "short name", because short names contain somewhat unexpected things like "Bahamas (the)" - so also isn't something easily produced.

Btw. let me know if this is not a good time to look through this.

nevrome commented 3 years ago

Ah - didn't know that. That's a good point! So maybe we should switch to that. So far we're not validating this, but we really should at some point.

It's a perfect time to look through this. But our team is small and it will take some time to actually pick it up and implement your suggestions.

stschiff commented 3 years ago

Yes, I like the idea to switch to alpha-2 encoding. Definitely I'm supportive to switch to this in our central repository, but I'm a bit reluctant to enforce that at the package format level, because it would make it harder for people to set up a quick private package. The moment they put "Germany" into the Janno file, the validator would immediately bug them... perhaps we can find a way to downgrade that to a recommendation, issuing a warning, rather than a full-blown parsing error...

xrotwang commented 3 years ago

Country metadata may not be crucial, so convenience may be better than accuracy here. Also there's only about 200 of these - so if it gets messy, it will be a small mess :)

Letting people refer to languages by name for decades - OTOH - lead to quite a mess ... So yes, it's a trade-off.

stschiff commented 3 years ago

Yes, I hear you. Definitely something we need to make a decision on.

nevrome commented 1 year ago

As this issue was raised again recently, I think we should move forward and switch to ISO-alpha2 or -alpha3 codes as defined here.

As @stschiff suggested, we should not validate this too strictly in trident. Not least for the fact that countries change every now and then and new valid entities can arise over night. A warning should be enough.

stschiff commented 1 year ago

Hmm, should we perhaps rather introduce a new field that is then validated strictly against ISO?

nevrome commented 1 year ago

Ok - that's a good idea. Strictly is imho not possible though for aforementioned reason. Or did you just mean: Print a warning?

stschiff commented 1 year ago

Yes, OK. Print a warning. What could be the field's name? How about Country_ISO? and we could allow both alpha2 and alpha3?

nevrome commented 1 year ago

What's the advantage of allowing both? It makes summary statistics more difficult, because you first have to unify the 2- and 3-letter codes that refer to the same country.

xrotwang commented 1 year ago

At least they are are easy to detect and distinguish :)

xrotwang commented 1 year ago

If choosing one standard over the other, I'd recommend alpha2. From my experience, alpha3 codes are more prone to being confused with (ISO 639-3) language codes.

stschiff commented 1 year ago

No, you're right, there is no good reason to allow both. Let's go with alpha2 then.

stschiff commented 1 year ago

Closed now with new introductions in #57 (schema release v2.7.0)