microsoft / Recognizers-Text

Microsoft.Recognizers.Text provides recognition and resolution of numbers, units, date/time, etc. in multiple languages (ZH, EN, FR, ES, PT, DE, IT, TR, HI, NL. Partial support for JA, KO, AR, SV). Packages available at: https://www.nuget.org/profiles/Recognizers.Text, https://www.npmjs.com/~recognizers.text
MIT License
1.67k stars 429 forks source link

[EN Units] False positives for common/informal units #2977

Closed satya77 closed 1 year ago

satya77 commented 2 years ago
  1. The package makes a lot of mistake when it comes to the unit Picometer and most extractions are not relevant, the main problem is that it confuses pm with picometer for example:

Put out the bad news on New Year's Eve, or a Friday at 6pm, too late for the TV news. -> ['6pm']

However, takeaway deliveries are permitted only until 11pm under the government restrictions. -> ['11pm']

  1. Another problem is with Inch almost anytime it sees the word in it assumes it to be an inch:

It has also doubled its work force to 50,000 in 2020, becoming South Koreas third-largest private-sector employer.-> '50,000 in'

Last year alone, it said, it invested $443 million in the automation of its warehouses and increased its warehouse work force by 78 percent, to 28,400, to make its workers more efficient and lessen the workload.-> 443 million in

  1. Next problem is with C :

Operating profit and funds from operations for 2020 were $765 million or 14.7C/ per security. -> '14.7c'

The March 2021 C.P.I. forecast is the median estimate in a Bloomberg survey of economists, as of the morning of April 12. -> 2021 c'

Qs cash profits for the first half of fiscal 2021, announced on Thursday, rose 9 per cent to $165 million, led by a turnaround in its retail banking arm, and its interim dividend by 11c to 17c a share.-> [ '11c', '17c']

  1. Same applies to F:

Ford routinely sells more than 800,000 F-Series pickups ->'800,000 f'

Direct Relief, which has long worked with FedEx, also partnered with the shipping giant to charter a Boeing 777F to transport supplies to India free of cost-> '777f'

The 13F filing provides one of the first examples of how a hedge fund attempted to capitalize on the distressed remains of Archegos. -> '13f'

tellarin commented 1 year ago

Yeah... Unofrtunately false positives are common in rule-bases systems. I've added some mitigation for all these cases and a couple more. Except for the "million in" case. I'll open a bug for that one.