miha42-github / company_dns

An open source micro-service focused that provides company data from EDGAR plus Wikipedia, and SIC lookup.
https://miha42-github.github.io/company_dns/
Apache License 2.0
9 stars 2 forks source link

Add ticker data from Wikipedia #7

Closed miha42-github closed 1 year ago

miha42-github commented 1 year ago

Accessing the ticker data from Wikipedia is problematic. We've researched a python regex (see below) which should work to capture the ticker data for many if not most companies. There remains some work to implement, test this and then look for alternative ways to capture the rest.

pattern = '\{\{.+?\|.+?\}\}'
re.findall(pattern, fomoco)[-1]

The above might be too greedy, but we can start there.

Some examples to work with are below:

Hitachi: 
{{plainlist|
*|TYO|6501|
*|NAG|6501|
*[[Nikkei 225]] component (TYO)
*[[TOPIX]] Core30 component (TYO)}} {{TYO|6501}} * {{NAG|6501}} *[[Nikkei 225]] component (TYO)
*[[TOPIX]] Core30 component (TYO)

Aramco:
{{Saudi Stock Exchange|2222}}

IBM:
{{ubl|NYSE|IBM|[[DJIA]] component|[[S&P 100]] component|[[S&P 500]] component}} {{NYSE|IBM}}

Fujitsu:
{{Unbulleted list|tyo|6702|NAG|6702|[[Nikkei 225]] component (TYO)|[[TOPIX]] Large70 component (TYO)}} {{tyo|6702}} {{NAG|6702}}

Ford Motors:
{{unbulleted list|nyse|F|[[S&P 100|S&P 100 Component]]|[[S&P 500|S&P 500 Component]]}} {{nyse|F}}

HSBC:
{{plainlist|
*|LSE|HSBA|
*|SEHK|5|
*|NYSE|HSBC|
*|bsx|id|=|1077223879|HSBC.BH|
*[[FTSE 100 Index|FTSE 100]] component (HSBA)
*[[Hang Seng Index|Hang Seng]] component (5)}} {{LSE|HSBA}} * {{SEHK|5}} * {{NYSE|HSBC}} * {{bsx|id|=|1077223879|HSBC.BH}} *[[FTSE 100 Index|FTSE 100]] component (HSBA)
*[[Hang Seng Index|Hang Seng]] component (5)

SAP:
{{FWB|SAP}} <br />[[DAX|DAX Component]]

Tesla Inc:
{{Unbulleted list
   | |NASDAQ|TSLA|
   | [[Nasdaq-100]] component
   | [[S&P 100]] component
   | [[S&P 500]] component}} {{NASDAQ|TSLA}}

Sony:
{{plainlist|
* |Tyo|6758|
* |Nyse|SONY|
* [[Nikkei 225]] component (6758)
* [[TOPIX]] Core30 component (6758)}} {{Tyo|6758}} * {{Nyse|SONY}} * [[Nikkei 225]] component (6758)
* [[TOPIX]] Core30 component (6758)

Teradata:
{{Unbulleted list|nyse|TDC|[[S&P 400]] component}} {{nyse|TDC}}

An idea to consider is to create some if/then/else logic to look at planlist, unbulleted list, nothing, etc. This would left different regexes act on the strings.