serpro69 / kotlin-faker

Port of a popular ruby faker gem written in kotlin. Generate realistically looking fake data such as names, addresses, banking details, and many more, that can be used for testing and data anonymization purposes.
https://serpro69.github.io/kotlin-faker/
MIT License
468 stars 41 forks source link

Random country code returned for address with locale `en-US` #231

Closed kukido closed 5 months ago

kukido commented 6 months ago

Looks like address generation got broken in version 1.14.0

@Test
fun address() {
    // 1.14.0 - broken
    // 1.13.0 - works
    val faker = Faker(
        fakerConfig { locale = "en-US" }
    )
    val address = faker.address
    assertThat(address.countryCode()).isEqualTo("US")
}
expected:<"[US]"> but was:<"[AF]">
Expected :"[US]"
Actual   :"[AF]"
kukido commented 6 months ago

https://github.com/serpro69/kotlin-faker/pull/176/files#diff-9b50c5ec0823d84e72cc58a7138c151623b0d88634a07042cbf31099ef3ad0ddL34

serpro69 commented 6 months ago

Hi @kukido , Thank you for opening this issue!

As of current master version , the en-US.yml file does not override country_code defined in en/address.yml, and the latter returns a list of codes of various countries, so the behavior you're seeing is to be expected.

This has likely changed between versions you mention because the yml files aren't managed by kotlin-faker, and I have a rule of using them as is (with a very few exceptions) to avoid maintenance overhead. So likely during one of the upstream pulls, the data in the address.yml has changed and country_code started "returning" a list of codes instead of a single code for a given country. And the current version of yml files in ruby-faker ( https://github.com/faker-ruby/faker/blob/e15e606d7afcac499f08e0b88fab3f494239174f/lib/locales/en/address.yml#L770 ) also returns a list of codes, so there's not much to do here that I can see.

May I ask what is your use-case for returning "US" as a single value from faker? From my point-of-view, current implementation also makes sense logically and covers a broader range of uses. I don't see many use-cases for returning just a single value, i.e. "US" string`, since in most cases it could just as well be hardcoded. One thing I can think of is you're testing multiple locales with Faker and want to return a country code based on the locale.

kukido commented 6 months ago

Thank you for quick response! Intuitively I would expect en locale to return any country code and if the country code is specified, like en-US, en-NZ, en-GB, etc. to return that specific country code. Otherwise what is the point of specifying the country code if random country code is returned, I have en for that 😅

Our use-case is simple, we create US-based addresses in most cases, so your proposal to hardcode US works, and that's what we have in the code right now.

It was just odd to see the behavior to change between versions as it was not mentioned in the changelog.

serpro69 commented 6 months ago

Hmm, yeah, I can see your point and agree that such behavior would be natural. But most of the current localized files don't override the address.country_code field unfortunately, hence the default (random from a list in en/address) is returned. To fix this we'd need to submit a PR to ruby-faker with the changes to localized yml files and then pull them here. Fixing them here would be the last resort and I'm not sure this particular issue justifies it. Doing so will make maintenance of the yml files much more harder since the manual changes in kotlin-faker yml files would be overwritten again in next "pull" from ruby-faker.

I can also see how it can be confusing when the behavior changes between versions. The problem is, I usually pull yml files from ruby-faker maybe once per minor version update, so it's not very frequent and hence done "in bulk" so to speak. This results in quite big changes and makes it very hard to do a proper review of, so I just make sure nothing gets broken from code perspective, but don't really verify that the data that's returned is "the same" (with a few exceptions, mostly when something gets broken in code). I guess one way to mitigate this would be to mention in the changelog all the data providers that were changed ?(Currently I just mention the addition of new functionality and a few other cases when I know something has changed significantly from previous version.)

I'll mark it as "needs fixing upstream for now", but feel free to propose an alternative solution if you can think of one :)

kukido commented 6 months ago

Looks like there's default_country_code in faker-ruby: https://github.com/faker-ruby/faker/commit/4d52bc1e148d5290da10cb2775e244e0023778b0 And for en-US it is US (source)

kukido commented 6 months ago

So ideally, the library should take defaultCountryCode for the locale, and then fallback to countryCode if does not exist. What do you think about that approach?

kukido commented 6 months ago

Looked into it a bit further, exactly the same bug was submitted with faker-ruby, after which they've added default_country_code:

Describe the bug
if locale is set, then Address.country_code does not consider it, but e.g. street/city/... do.

To Reproduce
irb(main):006:0> I18n.locale = 'da-DK'
=> "da-DK"
irb(main):009:0> Faker::Address.country_code
=> "CR"
irb(main):010:0> Faker::Address.city
=> "Jydelund"

Expected behavior
that the country_code act the same way?!
serpro69 commented 6 months ago

Yeah, I saw there's a "default_country_code" in the localized dictionaries and I've thought about using that, but kotlin-faker currently does not support non-default localized keys (e.g. a key that is present in en-US.yml but not present in en/*.yml ) I think this could be a good use-case to finally implement that :) Would you like to submit a PR for this change? Else I can take a look into this myself when I have some extra time for it.

serpro69 commented 6 months ago

I guess one simple way to implement this while avoiding having to deal with the whole "non-default keys" thing, would be to check the locale value, and then either fetch country_code if locales is en, or fetch default_country_code otherwise. On the downside, this would require all localized dictionaries to have the default_country_code present (which is not that easy to guarantee either), since there won't be any defaults to fall-back to in en/address.yml

serpro69 commented 5 months ago

This is now implemented and will be available in next version. Similar cases should also be quite simple to fix since this was done via supporting "alternative keys" in the yml.

Thanks for your inputs and suggestions, @kukido

kukido commented 5 months ago

Thank you for the fix @serpro69 ! As I understand, it will be available in 2.0, correct?

serpro69 commented 5 months ago

Yes, but you can also use the next 2.0 release-candidate if you don't want to wait until the stable release version is out ;) It should be available in snapshots also (once I fix publishing of snapshots...), but I wouldn't recommend those with the next major version since they build from master.