Query tables from plain text files

leftmove commented 4 months ago

The most recent addition to wallstreetlocal was the ability to query XML files along with HTML files. The only format remaining to code in now, is plain text (TXT).

The SEC's XML and HTML stocks were barely structured enough to be queried accurately, but TXT provides an even harder challenge. The problem is the inconsistency. While tables in TXT can be read fairly easily by human eyes, they are too disimilar to query effectively.

Here are some minified examples.

<TABLE>            <C>                                              <C>
                                            FO     RM 13F IFORMATIONTABLE
                                            VALUE  SHARES/ SH/ PUT/ INVSTMT  OTHER  VOT    ING AUTRITY
NAME OF ISSUER     TITLE OF CLASS CUSIP     (X1000)PRN AMT PRN CALL DSCRETN  MANAGERSOLE   SHARED NONE
------------------------------------------- ------------------ ---- -------  -----------------------------
AFLAC INC          COMMON STOCK   001055102       3      71SH       DEFINED       71      0      0
AGL RESOURCES INC  COMMON STOCK   001204106     123    3025SH       DEFINED     3025      0      0
ABBOTT LABS COM    COMMON STOCK   002824100    1606   30519SH       DEFINED    27798   2721      0
ABERCROMBIE & FITCHCOMMON STOCK   002896207       0       2SH       DEFINED        2      0      0
AIR PRODUCTS & CHEMCOMMON STOCK   009158106   16728  175017SH       DEFINED   140030   2282  32705
AIRGAS INC         COMMON STOCK   009363102       4      52SH       DEFINED       52      0      0
</TABLE>

<TABLE>                   <C>   <C>        <C>       <C>                <C>   <C>   <C>
                                             VALUE                       INV.  OTH   vtng
      NAME OF ISSUER      CLASS    CUSIP    (x$1000)       SHARES        disc  MGRS  AUTH

Albertson College of Idaho Large Growth
ADC TELECOMMUNICATIO      COMM  000886101         $18         337.00     Sole  N/A   Sole
AFLAC INC                 COMM  001055102         $14         298.00     Sole  N/A   Sole
AES CORP                  COMM  00130H105         $18         232.00     Sole  N/A   Sole
AXA FINL INC              COMM  002451102         $18         504.00     Sole  N/A   Sole
ABBOTT LABS               COMM  002824100         $61       1,724.00     Sole  N/A   Sole
ABERCROMBIE & FITCH       COMM  002896207          $2         114.00     Sole  N/A   Sole
</TABLE>

<TABLE>
                                                             VALUE    SHARES/ SH/ PUT/ INVSTMT            -----VOTING AUTHORITY-----
  NAME OF ISSUER                 -TITLE OF CLASS- --CUSIP-- (X$1000)  PRN AMT PRN CALL DSCRETN -MANAGERS-     SOLE   SHARED     NONE
                                 <C>                                              <C>
D DAIMLERCHRYSLER AG             ORD              D1668R123        5      112 SH       DEFINED 05              112        0        0
D DAIMLERCHRYSLER AG             ORD              D1668R123       31      748 SH       DEFINED 05              748        0        0
D DAIMLERCHRYSLER AG             ORD              D1668R123        5      130 SH       DEFINED 06              130        0        0
D DAIMLERCHRYSLER AG             ORD              D1668R123      246     5894 SH       DEFINED 14             3089        0     2805
D DAIMLERCHRYSLER AG             ORD              D1668R123      118     2832 SH       DEFINED 14             2104      604      124
D DAIMLERCHRYSLER AG             ORD              D1668R123        8      200 SH       DEFINED 29              200        0        0
D DAIMLERCHRYSLER AG             ORD              D1668R123       63     1510 SH       DEFINED 41                0        0     1510
</TABLE>

The column sizes, names, and overall formatting of each table changes too often for any meanginful code to be written. Without writing a gargantuan amount of code, or using AI (which is expensive), there doesn't seem to be much way to query stocks like this.

There should be a better, more effective method to taking the TXT tables, and creating usable, structured data.

ahiddenproxy commented 4 months ago

I'd be happy to help you take this on. I've faced similar data inconsistencies when building tooling for real estate projects

leftmove commented 3 months ago

@ahiddenproxy I have assigned you, please let me know if you need any additional context!

Thank you!

(Sorry for the late reply)

leftmove commented 3 months ago

I should also mention, you don't need to worry about the pedantic details of the source code, if you don't wish to.

I can point you to the function where text input is given as a table, and you can work from there on creating a function that returns structured data.

As long some sort of querying function is created to turn the table into structured data, I can implement said function into the code so it works with everything else properly.

leftmove commented 3 months ago

Here is the relevant code.

def scrape_txt(cik, filing, directory):
    pass

def scrape_html(cik, filing, directory, empty=False):
    data = api.sec_directory_search(cik, directory)
    stock_soup = BeautifulSoup(data, "lxml")
    stock_table = stock_soup.find_all("table")[3]
    stock_fields = stock_table.find_all("tr")[1:3]
    stock_rows = stock_table.find_all("tr")[3:]

    (
        nameColumn,
        classColumn,
        cusipColumn,
        valueColumn,
        shrsColumn,
        multiplier,
    ) = sort_rows(stock_fields[0], stock_fields[1])

    row_stocks = {}
    report_date = filing["report_date"]
    access_number = filing["access_number"]

    for row in stock_rows:
        columns = row.find_all("td")

        if empty:
            yield None

        stock_cusip = columns[cusipColumn].text
        stock_name = columns[nameColumn].text
        stock_value = float(columns[valueColumn].text.replace(",", "")) * multiplier
        stock_shrs_amt = float(columns[shrsColumn].text.replace(",", ""))
        stock_class = columns[classColumn].text

        row_stock = row_stocks.get(stock_cusip)

        if row_stock is None:
            new_stock = {
                "name": stock_name,
                "ticker": "NA",
                "class": stock_class,
                "market_value": stock_value,
                "shares_held": stock_shrs_amt,
                "cusip": stock_cusip,
                "date": report_date,
                "access_number": access_number,
            }
        else:
            new_stock = row_stock
            new_stock["market_value"] = row_stock["market_value"] + stock_value
            new_stock["shares_held"] = row_stock["shares_held"] + stock_shrs_amt

        row_stocks[stock_cusip] = new_stock
        yield new_stock

def scrape_xml(cik, filing, directory, empty=False):
    data = api.sec_directory_search(cik, directory)
    data_str = data.decode(json.detect_encoding(data))
    tree = ElementTree.fromstring(data_str)

    info_table = {}
    namespace = {"ns": "http://www.sec.gov/edgar/document/thirteenf/informationtable"}

    report_date = filing["report_date"]
    access_number = filing["access_number"]

    for info in tree.findall("ns:infoTable", namespace):
        if empty:
            yield None

        stock_cusip = info.find("ns:cusip", namespace).text
        stock_name = info.find("ns:nameOfIssuer", namespace).text
        stock_value = float(info.find("ns:value", namespace).text.replace(",", ""))
        stock_shrs_amt = float(
            info.find("ns:shrsOrPrnAmt", namespace)
            .find("ns:sshPrnamt", namespace)
            .text.replace(",", "")
        )
        stock_class = info.find("ns:titleOfClass", namespace).text

        info_stock = info.get(stock_cusip)

        if info_stock is None:
            new_stock = {
                "name": stock_name,
                "ticker": "NA",
                "class": stock_class,
                "market_value": stock_value,
                "shares_held": stock_shrs_amt,
                "cusip": stock_cusip,
                "date": report_date,
                "access_number": access_number,
            }
        else:
            new_stock = info_stock
            new_stock["market_value"] = info_stock["market_value"] + stock_value
            new_stock["shares_held"] = info_stock["shares_held"] + stock_shrs_amt

        info_table[stock_cusip] = new_stock
        yield new_stock

info_table_key = ["INFORMATION TABLE", "Complete submission text file"]

def scrape_stocks(cik, data, filing, empty=False):
    index_soup = BeautifulSoup(data, "lxml")
    rows = index_soup.find_all("tr")
    directory = {"link": None, "type": None}
    for row in rows:
        items = list(map(lambda b: b.text.strip(), row))
        if any(item in items for item in info_table_key):
            link = row.find("a")
            href = link["href"]

            is_xml = True if href.endswith(".xml") else False
            is_html = True if "xslForm" in href else False
            is_txt = False

            directory_type = directory["type"]
            if is_xml and not is_html:
                directory["type"] = "xml"
                directory["link"] = href
            elif is_xml and is_html and directory_type != "xml":
                directory["type"] = "html"
                directory["link"] = href
            elif is_txt and directory_type != "xml" and directory_type != "html":
                directory["type"] = "txt"
                directory["link"] = href

    link = directory["link"]
    form = directory["type"]
    if not link:
        filing_stocks = {}
        return filing_stocks

    if form == "html":
        scrape_document = scrape_html
    elif form == "xml":
        scrape_document = scrape_xml
    # elif form == "txt":
    #     scrape_document = scrape_txt

    if empty:
        for i, _ in enumerate(scrape_document(cik, filing, link, empty)):
            row_count = i
        return row_count

    update_list = [new_stock for new_stock in scrape_document(cik, filing, link)]
    updated_stocks = process_names(update_list, cik)

    filing_stocks = {}
    for new_stock in update_list:
        stock_cusip = new_stock["cusip"]
        updated_stock = updated_stocks[stock_cusip]

        updated_stock.pop("_id", None)
        new_stock.update(updated_stocks[stock_cusip])

        filing_stocks[stock_cusip] = new_stock

    return filing_stocks

The link may be slightly different then the provided code, as some reliability edits have been made

leftmove commented 2 months ago

It's been a month since I assigned @ahiddenproxy, and there has not yet been any code committed. Therefore, I will be unassigning them.

This issue is up for grabs.

@ahiddenproxy Let me know if you still wish to work on this - I'll reassign you as soon as I can.

chickenleaf commented 1 month ago

@leftmove hello!! I'd like to try my luck on this!! has anybody tried this with regular expressions? what is the most accurate regex someone has gotten to getting this right?

parthmshah1302 commented 1 month ago

@leftmove Can I be assigned to this?

jass024 commented 1 month ago

hi @leftmove i'd also faced a quite similar problem while processing the text data. Since I'm new to open-source contribution i'd like to solve this problem, waiting for a response from your side.

leftmove commented 1 month ago

@chickenleaf @parthmshah1302 @jass024

Sorry for the late response, guys.

I'll assign all three of you for now, and I encourage you all to collaborate and discuss the issue here, or anywhere else you wish.

I suspect three people is too much, but I'll let you guys decide: either give this issue your best shot, or let me if you wish to be unassigned. If it's the ladder, thanks anyway for trying.

If you do decide to work on this issue though, make sure to add a comment describing what you will attempt, before you attempt it. This is so that others do not waste time and effort trying to do what you've already accomplished.

Thanks.

leftmove commented 1 month ago

@chickenleaf

I tried regex myself, but that data seems far too inconsistent. It might be worth a shot though, since I am a regex amateur.

The closest I got was using Pandas, as that has a text table feature, but again, the data was too inconsistent.

jass024 commented 1 month ago

hi @leftmove thanks for assigning me I'll share my insights and approach as how i'm thinking it can be solved, but as of now i'm in bit other work which requires my time. I'll be back within 3 days. Happy solving issues:)

leftmove commented 1 month ago

I have just realized that the relevant code I linked above is no longer valid.

Here is the new link and examples of current querying methods.

# What this issue is about
def scrape_txt(cik, filing, directory):
    pass

# HTML Querying
def scrape_html(cik, filing, directory, empty=False):
    data = api.sec_directory_search(cik, directory)
    stock_soup = BeautifulSoup(data, "lxml")
    stock_table = stock_soup.find_all("table")[3]
    stock_fields = stock_table.find_all("tr")[1:3]
    stock_rows = stock_table.find_all("tr")[3:]

    (
        nameColumn,
        classColumn,
        cusipColumn,
        valueColumn,
        shrsColumn,
        multiplier,
    ) = sort_rows(stock_fields[0], stock_fields[1])

    row_stocks = {}
    report_date = filing["report_date"]
    access_number = filing["access_number"]

    for row in stock_rows:
        columns = row.find_all("td")

        if empty:
            yield None

        stock_cusip = columns[cusipColumn].text
        stock_name = columns[nameColumn].text
        stock_value = float(columns[valueColumn].text.replace(",", "")) * multiplier
        stock_shrs_amt = float(columns[shrsColumn].text.replace(",", ""))
        stock_class = columns[classColumn].text

        row_stock = row_stocks.get(stock_cusip)

        if row_stock is None:
            new_stock = {
                "name": stock_name,
                "ticker": "NA",
                "class": stock_class,
                "market_value": stock_value,
                "shares_held": stock_shrs_amt,
                "cusip": stock_cusip,
                "date": report_date,
                "access_number": access_number,
            }
        else:
            new_stock = row_stock
            new_stock["market_value"] = row_stock["market_value"] + stock_value
            new_stock["shares_held"] = row_stock["shares_held"] + stock_shrs_amt

        row_stocks[stock_cusip] = new_stock
        yield new_stock

# XML Querying
def scrape_xml(cik, filing, directory, empty=False):
    data = api.sec_directory_search(cik, directory)
    data_str = data.decode(json.detect_encoding(data))
    tree = ElementTree.fromstring(data_str)

    info_table = {}
    namespace = {"ns": "http://www.sec.gov/edgar/document/thirteenf/informationtable"}

    report_date = filing["report_date"]
    access_number = filing["access_number"]

    for info in tree.findall("ns:infoTable", namespace):
        if empty:
            yield None

        stock_cusip = info.find("ns:cusip", namespace).text
        stock_name = info.find("ns:nameOfIssuer", namespace).text
        stock_value = float(info.find("ns:value", namespace).text.replace(",", ""))
        stock_shrs_amt = float(
            info.find("ns:shrsOrPrnAmt", namespace)
            .find("ns:sshPrnamt", namespace)
            .text.replace(",", "")
        )
        stock_class = info.find("ns:titleOfClass", namespace).text

        info_stock = info.get(stock_cusip)

        if info_stock is None:
            new_stock = {
                "name": stock_name,
                "ticker": "NA",
                "class": stock_class,
                "market_value": stock_value,
                "shares_held": stock_shrs_amt,
                "cusip": stock_cusip,
                "date": report_date,
                "access_number": access_number,
            }
        else:
            new_stock = info_stock
            new_stock["market_value"] = info_stock["market_value"] + stock_value
            new_stock["shares_held"] = info_stock["shares_held"] + stock_shrs_amt

        info_table[stock_cusip] = new_stock
        yield new_stock

info_table_key = ["INFORMATION TABLE", "Complete submission text file"]

The objective is not necessarily to complete the scrape_txt function, but to create a function that returns data structured in the same way as scrape_html and scrape_xml.

chickenleaf commented 1 month ago

@leftmove Hello, could you share some more details on what you've tried with panda? regex is not getting it done for me, im thinking of making an attempt at NLP?

leftmove commented 1 month ago

@chickenleaf Sorry to dissapoint, but I didn't get far.

Pandas needs a consistent layout; if you can achieve one, it's fairly easy to get structured data (see this StackOverflow question, among others if you do some Googling).

Getting a consistent layout though, is obviously easier said than done.

NLP sounds like it could definitely work, but from what I know (I'm a novice) it will definitely take some work. If you need training data, just let me know and I will gather as much as I can.

Otherwise, I don't really know how to help. I commend you for taking on such a brave task, and hopefully it does not take too much of your time. I will try to help in any way I can.

jass024 commented 1 week ago

hi @leftmove i googled about it and came to the conclusion that ML will take a lot of work for this and i think regex will work, I'm trying this to solve it using regex and let's see where it will lead....if the issue gets solved congrats to us otherwise learning is there.

leftmove / wallstreetlocal

Query tables from plain text files #8