scrapinghub / number-parser

Parse numbers written in natural language
BSD 3-Clause "New" or "Revised" License
104 stars 23 forks source link

Integrate number-parser into dateparser and price-parser #61

Open Manish-210 opened 3 years ago

Manish-210 commented 3 years ago

One of the major use after building the number-parser might be to integrate it with price-parser and dateparser as suggested by @noviluni . There are many features that are similar in these libraries, consider:

4 in number-parser and #1 in price-parser. In both of these, the basic idea is to return the string of numbers mixed with words to return it as a number.

Example:

parse("1 million")                                                              
>>> '1000000'

parse("2k in USD")                                                              
>>> '2000 in USD'

This feature seems to be more favourable for number-parser and then integrate it with price-parser.

There are many similar features that are related to each other in many ways, hence integrating might be a good option, but I have a few questions related to it.

using both libraries in application or implementing this in backend

price1 = parse('115 millions de dollars (estimation)') # parse of number parser parse_price(price1) # parse_price of price parser Price(amount=Decimal('115000000'), currency='dollars')

I was thinking to take this as a part of the GSoC 2021 also, would like to hear your views.

Gallaecio commented 3 years ago

From the beginning, the goal has been for price-parser and Dateparser to make use of number-parser, in a way that is transparent to users. So, to your first question, I would go for the example 1, where users simply use price-parser, which uses number-parser underneath. When using price-parser, users do not even need to know that number-parser exists.

As for features that could be added to number-parser to benefit Dateparser:

Mind that the prior Google Summer of Code project that created the library already started with the integration, it simply did not finish: https://github.com/scrapinghub/dateparser/pull/711

arnavkapoor commented 3 years ago

Hi, @Manish-210; glad for the interest in the project. As @Gallaecio has mentioned, one of the objectives was to use number-parser in both price and date parser. The primary bottleneck in merging is not the final integration of the number-parser with either price-parser / date-parser (This could be just a single call to the number-parser library like this scrapinghub/dateparser#711 ). The more pertinent issue is that it becomes a dependency to two libraries with a large user base. Ideally, it should be a bit more mature (support more languages, more features/edge cases), better tests, and higher coverage.

Feel free to tag me anytime you need help with the codebase/ideas will try my best to respond. Keeping the spirit of open-source and open-discussion let's talk here instead of direct message.

Manish-210 commented 3 years ago

Thanks, @Gallaecio and @arnavkapoor I have looked at scrapinghub/dateparser#711 . I think I need to familiarize myself more with the code of dateparser and number-parser before I ask more related queries.

Just a general question for integrating dateparser and number-parser should I have to commit in scrapinghub/dateparser#711 or I have to start with a new PR.

and there does not exist any pull request for the integration of number-parser and price-parser, right ?

Gallaecio commented 3 years ago

should I have to commit in scrapinghub/dateparser#711 or I have to start with a new PR.

That pull request corresponds to the number_parser_incorporation branch of @arnavkapoor’s fork. To work on top of his changes, you could add his fork as a Git remote to your local clone of the Dateparser repository, create a new branch starting from it, push that new branch to your own fork, and create a new pull request from there.

there does not exist any pull request for the integration of number-parser and price-parser, right ?

None exists, no. The closest things is https://github.com/scrapinghub/price-parser/pull/11, but we should probably close that in favor of an approach that uses number-parser.

Manish-210 commented 3 years ago

Hello, @Gallaecio I have prepared a draft of my proposal on Google Docs: https://docs.google.com/document/d/16TLsg3NUljkC-FWJcpO3ZJbP68WoxDjBvP9mQWQfsZg/edit?usp=sharing

I'd be very happy about feedback.

Gallaecio commented 3 years ago

@Manish-210 It looks like a great start.

A few thoughts:

AmPhIbIaN26 commented 3 years ago

@Gallaecio @Manish-210 Is this issue solved or can i work on it??

Gallaecio commented 3 years ago

@AmPhIbIaN26 If you mean to participate in Google Summer of Code 2021 as a student and work on this, feel free to work on a proposal. See https://gsoc2021.zyte.com/participate for details.

If you mean as a regular contributor, since this is a big endeavor and already a target for GSoC proposals, it would be best not to work on this now, and let GSoC candidate students work on their proposals.

AmPhIbIaN26 commented 3 years ago

I am thinking of writing a proposal on it for Google Summer of Code 2021, thanks I'll look it up

AmPhIbIaN26 commented 3 years ago

@Gallaecio I wanted to ask, do I have to make number parser a part of both price and date parser?

Gallaecio commented 3 years ago

@AmPhIbIaN26 No, you don’t. You could focus on one of them, or you could only work on number-parser itself if you came come up with a set of features to implement that fits a GSoC project timeline.

AmPhIbIaN26 commented 3 years ago

@Gallaecio I looked up this issue would you recommend this for my proposal for GSoC 2021?

Gallaecio commented 3 years ago

@AmPhIbIaN26 If there are enough presently-used numerical system to fit the length of a GSoC project, yes, I think that would be a good choice.

AmPhIbIaN26 commented 3 years ago

Thanks I'll look it up.

Manish-210 commented 3 years ago

Thanks for the great feedback @Gallaecio I have come up with the following points according to your feedback,

for bullet three: week 9 stretch goal and week 10

I agree with your phrase, so I will eradicate these extra stretch goals and devote weeks 9 and 10 as extras weeks for integration.

for bullet two: merge all the new features of number parser into dateparser

Firstly, sorry for that absurd language. I was actually saying like there might be the requirement of supporting new languages for dateparser or any other feature that might be necessary while integrating. Also, if there are not any extra requirements then I was thinking to work on #848 or #239 (mentioned previously). I cannot think of implementing it in code because we have to first agree for the interface.

for bullet one: a more detailed action plan that shows that you are already familiar enough with the Dateparser and price-parser code base.

for the implementation of other features, I was thinking to rely on the issues generated while integrating. (point out if there is any misunderstanding ).

Sorry, if I missed any of your points. let me know any other points.