miha42-github / company_dns

An open source micro-service focused that provides company data from EDGAR plus Wikipedia, and SIC lookup.
https://miha42-github.github.io/company_dns/
Apache License 2.0
9 stars 2 forks source link

Data Lineage endpoint and features #25

Closed miha42-github closed 1 year ago

miha42-github commented 1 year ago

New Feature Proposal: Data Lineage

Given that we're in a time where facts aren't always reliable and data sourcing can be considered suspect, it is important to create a way to show where data has originated from. Therefore, the intention of this feature set will be to create a digital map for the data source(s) both in the general context of the entire cached data set and in the specific endpoint context.

What is Data Lineage?

According to Wikipedia, "Data lineage includes the data origin, what happens to it, and where it moves over time. Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process."

General Case: /lineage endpoint

For data that is stored in the cache and for endpoints where data is gathered dynamically, an endpoint for lineage is needed. The report should be digital, meaning ideally in a JSON format, so that users should be able to programmatically trace the source and understand the processes. All steps should be described again in a JSON structure, and important included libraries can be referenced as the method to capture the data. Additionally, the report should include when the last local update was run for the cache creation and what was used, and both static and dynamic sources should be called out. Additional details will be provided in this issue as the feature is designed in sections below.

Ideas

  1. The report should be digital, meaning ideally in a JSON format, so that users should be able to programmatically trace the source, and understand the processes.
  2. All steps should be described again in a JSON structure
  3. Important included libraries can be referenced as the method to capture the data
  4. When the last local update was run for the cache creation and what was used should be included
  5. Both static and dynamic sources should be called out

Specific case lineage for each query endpoint

When a query is run, each endpoint should report the data source(s) including those inside the system via the cache. Ideally, these sources should be linkable, if they are digital, so those interested can follow the trail. The URLs, in particular for Wikipedia and EDGAR, should be disclosed with each query result as an additional JSON field. If locally cached data is used in the result, that should be referred to as well. Note that there can be references to the general lineage endpoint.

Ideas

  1. The urls, in particular for Wikipedia and EDGAR, should be disclosed with each query result as an additional JSON field
  2. If locally cached data is used in the result then that should be referred too as well. Note that there can be references to the general lineage endpoint.
miha42-github commented 1 year ago

This has been implemented in a simpler fashion than was documented above. Data lineage is now a part of every endpoint's operation. This enables the users of the service to understand for that endpoint where the data has come from and ideally which key python module is used to report on the data.

Here is an example with the keys for data and dependencies being the important points below:

{
  "code": 200,
  "message": "Wikipedia data and EDGAR has been detected and merged for the company [International Business Machines Corporation].",
  "module": "Query-> merge_data",
  "data": {
    "name": "INTERNATIONAL BUSINESS MACHINES CORP",
    "cik": "51143",
    "sic": "3570",
    "sicDescription": "Computer & office Equipment",
    "tickers": [
      "NYSE",
      "IBM"
    ],
    "exchanges": [
      "NYSE"
    ],
    "ein": "130871985",
    "description": "The International Business Machines Corporation ( IBM ), nicknamed Big Blue , is an American multinational technology corporation headquartered in Armonk, New York and present in over 175 countries. It specializes in computer hardware, middleware, and software, and provides hosting and consulting services in areas ranging from mainframe computers to nanotechnology. IBM is the largest industrial research organization in the world, with 19 research facilities across a dozen countries, and has held the record for most annual U.S. patents generated by a business for 29 consecutive years from 1993 to 2021.  IBM was founded in 1911 as the Computing-Tabulating-Recording Company (CTR), a holding company of manufacturers of record-keeping and measuring systems. It was renamed "International Business Machines" in 1924 and soon became the leading manufacturer of punch-card tabulating systems. For the next several decades, IBM would become an industry leader in several emerging technologies, including electric typewriters, electromechanical calculators, and personal computers. During the 1960s and 1970s, the IBM mainframe, exemplified by the System/360, was the dominant computing platform, and the company produced 80 percent of computers in the U.S. and 70 percent of computers worldwide.  After pioneering the multipurpose microcomputer in the 1980s, which set the standard for personal computers, IBM began losing its market dominance to emerging competitors. Beginning in the 1990s, the company began downsizing its operations and divesting from commodity production, most notably selling its personal computer division to the Lenovo Group in 2005. IBM has since concentrated on computer services, software, supercomputers, and scientific research. Since 2000, its supercomputers have consistently ranked among the most powerful in the world, and in 2001 it became the first company to generate more than 3,000 patents in one year, beating this record in 2008 with over 4,000 patents. As of 2022, the company held 150,000 patents.  As one of the world's oldest and largest technology companies, IBM has been responsible for several technological innovations, including the automated teller machine (ATM), dynamic random-access memory (DRAM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database, the SQL programming language, and the UPC barcode.The company has made inroads in advanced computer chips, quantum computing, artificial intelligence, and data infrastructure. IBM employees or alumni have won various recognitions for their scientific research and inventions, including six Nobel Prizes and six Turing Awards.  IBM is a publicly traded company and one of 30 companies in the Dow Jones Industrial Average. It is among the world's largest employers, with over 297,900 employees worldwide in 2022. Despite its relative decline within the technology sector, IBM is the seventh largest technology company by revenue, and 49th largest overall, according to _Fortune._ It is also consistently ranked among the world's most recognizable, valuable, and admired brands, with devoted following among many tech enthusiasts and consumers.",
    "website": [
      "https://www.ibm.com/",
      "https://www.ibm.com/uk-en",
      "https://www.ibm.com/us-en/",
      "https://www.ibm.com/de-de/"
    ],
    "category": "Large accelerated filer",
    "fiscalYearEnd": "1231",
    "stateOfIncorporation": "NY",
    "phone": "9144991900",
    "entityType": "operating",
    "companyFactsURL": "https://data.sec.gov/api/xbrl/companyfacts/CIK0000051143.json",
    "firmographicsURL": "https://data.sec.gov/submissions/CIK0000051143.json",
    "filingsURL": "https://www.sec.gov/cgi-bin/browse-edgar?CIK=51143&action=getcompany",
    "transactionsByIssuer": "https://www.sec.gov/cgi-bin/own-disp?action=getissuer&CIK=51143",
    "transactionsByOwner": "https://www.sec.gov/cgi-bin/own-disp?action=getowner&CIK=51143",
    "city": "ARMONK",
    "stateProvince": "NY",
    "zipPostal": "10504",
    "address": "1 NEW ORCHARD RD",
    "industryGroup": "357",
    "industryGroupDescription": "Computer And Office Equipment",
    "majorGroup": "35",
    "majorGroupDescription": "Industrial And Commercial Machinery And Computer Equipment",
    "division": "D",
    "divisionDescription": "Manufacturing",
    "forms": {
      "2018-2-27-000104746918001117": {
        "filingIndex": "https://www.sec.gov/Archives/edgar/data/51143/000104746918001117/0001047469-18-001117-index.html",
        "formType": "10-K"
      }
    },
    "wikipediaURL": "https://en.wikipedia.org/wiki/IBM",
    "type": "Public company",
    "industry": [
      "software industry",
      "computer hardware",
      "IT service management",
      "information technology consulting",
      "information technology"
    ],
    "country": "United States of America",
    "isin": "US4592001014",
    "longitude": -73.7203574803287,
    "latitude": 41.113410466869084,
    "googleMaps": "https://www.google.com/maps/place/1%20New%20Orchard%20Rd%2C%20Armonk%2C%20New%20York%2C%2010504",
    "googleNews": "https://news.google.com/search?q=INTERNATIONAL%20BUSINESS%20MACHINES%20CORP",
    "googlePatents": "https://patents.google.com/?assignee=INTERNATIONAL%20BUSINESS%20MACHINES%20CORP",
    "googleFinance": "https://www.google.com/finance/quote/IBM:NYSE"
  },
  "dependencies": {
    "modules": {
      "edgar": "https://github.com/miha42-github/company_dns",
      "wikipedia": "https://github.com/miha42-github/company_dns",
      "wptools": "https://pypi.org/project/wptools/",
      "geopy": "https://pypi.org/project/geopy/"
    },
    "data": {
      "sicData": "https://github.com/miha42-github/sic4-list",
      "oshaSICQuery": "https://www.osha.gov/data/sic-search",
      "wikiData": "https://www.wikidata.org/wiki/Wikidata:Data_access"
    }
  }
}