scrapinghub / price-parser

Extract price amount and currency symbol from a raw text string
BSD 3-Clause "New" or "Revised" License
316 stars 50 forks source link

Support for ISO 4217 currency codes #28

Open rthaenert opened 4 years ago

rthaenert commented 4 years ago

First of all: Nice library, thanks for creating it.

For converting between major currencies it would be nice to have the ISO 4217 code of the parsed price (EUR, USD, AUD, ...) as this is easier for handling exchange rates.

Is there any plan to support that?

lopuhin commented 4 years ago

This would be a great feature to have. Note that it's a bit tricky to implement, as relation between currency symbols and currency ISO codes is N:1, so we'll need to use other attribute like country to determine if $ means USD, AUD, HKD, SGD or other.

rthaenert commented 4 years ago

Yes, there are many cases in which this mapping would result in more than one currency code.

Maybe its a good idea to provide all matching currency codes with the first element always being a major currency code (the most used ones, basically the ones outlined in https://en.wikipedia.org/wiki/Currency_pair)?

Like this:

  € => [ "EUR" ]
  $ => [ "USD", "AUD", "CAD", ...]
AU$ => [ "AUD" ]
[...]

To decide between the different $'s the existing currency hint could be reused to get a precise mapping and for all cases in which it's unclear the list with all possible values should be good enough.

What do you think?

lopuhin commented 4 years ago

That's an interesting option which I didn't consider before. That would mean that the caller which has more info regarding the context would be able to select the best variant. And the caller which does not care much could take all or first. So it seems that this approach can work well. :+1: Also this looks quite future-proof to me.

Gallaecio commented 4 years ago

An alternative/complementary approach would be Dateparser’s, where users pass a locale to the parser, and the parser returns a value based on the specified locale.

rpalsaxena commented 4 years ago

@Gallaecio @lopuhin Is there any update regarding this feature? If it's in development, would love to contribute. :)

Gallaecio commented 4 years ago

I don’t think there is anyone working on it at the moment.

Akay7 commented 4 years ago

As suggest @Gallaecio it will be nicer if every locale will be able to redefine currency symbols.

If no one work on this, I can start work on this issue.

Gallaecio commented 4 years ago

There’s no pull request open so far, so feel free to go ahead.

ivanprado commented 3 years ago

FWIW List of circulating currencies: https://en.wikipedia.org/wiki/List_of_circulating_currencies and the support of currencies and locales in Babel: http://babel.pocoo.org/en/latest/api/numbers.html

ivsanro1 commented 2 years ago

There's a current implementation of this that I could add via PR.

This implementation works as follows:

  1. Given a input currency string (e.g. $, US$), it makes a fuzzy search (using python-Levenshtein) to select the best matching currency in a database.
  2. The currency codes of top matching currency(ies) are selected as "candidates" (they're in the database too). For example, for $ we'd have ['USD', 'CAD', 'AUD', ...], but for US$ we'd only have ['USD'] as candidates.
  3. We run a series of "disambiguation methods" to reduce the candidates list as much as possible. These disambiguation methods require additional external information like the plain text of the html of the webpage, the url, etc. This can greatly vary depending on the user's context.

The steps 1 and 2 could be added to price-parser, and it would not require further input from the user, i.e. it would not change the API:

>>> Price.fromstring('1200 $')
Price(amount=Decimal('1200'), currency='$', currency_codes=['USD', 'CAD', 'AUD', ...])

>>> Price.fromstring('1200 US$')
Price(amount=Decimal('1200'), currency='US$', currency_codes=['USD'])

The step 3 is a little more tricky, as it would require more inputs from the user.

Some examples of how the API could be:

# `hint_text` would be intended to use mainly with plain HTML
Price.fromstring('1200 $', hint_text='<html><body>... currency="USD"...</body></html>')
Price(amount=Decimal('1200'), currency='$', currency_codes=['USD'])

Price.fromstring('1200 $', hint_url='www.example.ca')
Price(amount=Decimal('1200'), currency='$', currency_codes=['CAD'])

However, in my opinion, this is beyond the scope of price-parser, I'd go for integrating only 1 and 2, and the user would have its own way of selecting from the candidates list, as @lopuhin pointed out, since they'd have more context about their problem.

Additionally, I wanted to point out that price-parser sometimes does not find the currency, especially when it's not "standard", here are some examples:

>>> Price.fromstring('1200 SFr')  # SFr is Swiss Franc. Currency code: CHF
Price(amount=Decimal('1200'), currency=None)

>>> Price.fromstring('1200 kz')  # "kz" is Angolan Kwanza. Currency code: AOA
Price(amount=Decimal('1200'), currency=None)

>>> Price.fromstring('دينار 1000')  # "دينار" is Bahraini dinar. Currency code: BHD
Price(amount=Decimal('1000'), currency=None)

>>> Price.fromstring('1000 BTC')  # "BTC" is Bitcoin. Currency code: BTC, although not part of ISO 4217, but widely adopted  
Price(amount=Decimal('1000'), currency=None)

So, unfortunately, the fuzzy search won't be so useful, as it's intended for when the currency can be less standard, and for finding currencies in a more robust way. The drawback of it is obvious: it can find wrong matches, especially because we don't use a similarity threshold to define "far matches" that should not be used.

We have three options here:

lopuhin commented 2 years ago

Thank you @ivsanro1 , an early comment on one point of your proposal

However, in my opinion, this is beyond the scope of price-parser, I'd go for integrating only 1 and 2, and the user would have its own way of selecting from the candidates list

To me it disambiguation also looks useful, as price parser is probably often used in web data extraction context, when these hints make sense. In terms of the API, it could be the same, but the list of currencies could be smaller.

Also regarding the API, if we add the currency_codes attribute to Price, it also makes sense to add a currency_code property which would be non-empty in case this list has one element, to simplify the usage.

lopuhin commented 2 years ago

@ivsanro1 regarding your last question,

Keep the feature and make price-parser find less typical currencies.

Looks best to me, but this can also be a different issue and a different PR. Even in current state the fuzzy matching looks useful as we can pass the currency_hint to Price.fromstring.

kmike commented 2 years ago

Hey! Could you please elaborate, why is fuzzy search needed here? I wonder if it'd be better to hardcode more currency variations. Or is it problematic for some reason?

ivsanro1 commented 2 years ago

Fuzzy search is only needed if we want to allow for non-exact matches. However, hardcoding the variations is also a perfectly valid approach and we would not have to worry about false positives (or at least as many as we could potentially have with fuzzy search).

In any case, it's slightly unrelated for the currency_code (sorry for that), I just mentioned it because it's related with the implementation I was describing.

umrashrf commented 9 months ago

My use is to use price-parser with Stripe amount and currency and it requires 3 digit ISO currency code instead of 2 digit $. https://docs.stripe.com/currencies?presentment-currency=MX

Right now I have to use this code.

def fix_currency(currency):
    # TODO: Use this gist https://gist.github.com/jylopez/ba16be2ae55282d5cff07de65128de83
    if currency == "MX$":
        return currency.replace("MX$", "MXN")
    elif currency == "C$":
        return currency.replace("C$", "CAD")
    elif currency == "$":
        return currency.replace("$", "USD")
    else:
        return currency