john-friedman / SEC-Parsers

MIT License
14 stars 7 forks source link

Getting the subtitle of an Item in 10K #3

Closed msharifbd closed 1 month ago

msharifbd commented 1 month ago

Hi, Is there any way to get the subtitle in the text of an Item. For example, using your module I am trying to parse Item 1C. I get everything ok, but the issue is that many of Item 1C in 10K has two subsections with a heading. I actually need those subsections heading (title). For example -

https://www.sec.gov/Archives/edgar/data/1785173/000095017024024008/etnb-20231231.htm

This above link 10K has Item 1C, and Item 1C has two subsection titles - Risk Management and Strategy, and Governance. when I parse using your sec-parsers, I easily get the text of Item 1C, but those two titles are not there. Is there any way I will get the text along with the two subsections title.

Thanks for the nice module.

john-friedman commented 1 month ago

Hi Sharif, subtitles are supported. To see the full tree use print(filing.get_title_tree())

Screenshot 2024-07-20 093617

The following code should work:

filing.get_node_text(filing.find_nodes_by_title('Risk Management and Strategy')[0])
filing.get_node_text(filing.find_nodes_by_title('Governance')[0])

Hope this helps!

msharifbd commented 1 month ago

Hi, This title has some little variations in different 10-Ks like Risk Management, Cybersecurity Risk Management and so on. I am trying to download a large number of Item 1Cs since December 15, 2023. Is there any way when I parse by title like - filing.find_nodes_by_title('item 1c'), those title will come in the text. Also, please note that some of the Item 1Cs do not have such titles.

Also, Is there any way we can also search the subtitle in Item 1C by index such as [0]?

john-friedman commented 1 month ago

Oh this is great! I haven't spent much time yet building quality of life features for users yet, so this conversation is very useful.

Let me make sure I understand what you want:

  1. A feature to extract the title of a section along with its text
  2. A way to access the subheadings of a section, e.g. for the item 1C above it would return risk management and strategy as well as governance

Is that correct?

msharifbd commented 1 month ago

Yes, you got it. Now, when I extract item 1C using ‘sec-parsers’ all texts come in that items except any subtitle, if any. But I want when I extract item 1C text, the text will include subtitles as well. Thanks

john-friedman commented 1 month ago

Great! I'll answer your question, and then I have a follow-up for you:

SEC Parsers parses filings into xml, which you can work with by selecting filing.xml

  1. To get title of a section: `filing.xml.attrib['title']
  2. To get subheadings, first select your section, and apply .getchildren()

This probably isn't a satisfying answer, so I'll add a built-in method in future updates. I'm currently working on a major update, so it might be a week or so.

Do you have any more feature requests?

msharifbd commented 1 month ago

Thank you very much. Right now, I do not have any more feature requests. Once you update, can you please give me a knock? Thanks again for your help and for your nice package.

john-friedman commented 1 month ago

Will do. Glad you're enjoying the package. I'll mark this issue complete after the update.

john-friedman commented 1 month ago

Hi Sharif, I just updated the package.

  1. To get the title of a section along with text, use filing.get_text_from_section(section,include_title=False)
  2. To get subsections, use filing.get_subsections_from_section(section)

Let me know if this works for you, and I'll close the issue. Btw, I have added you to the contributors.md.

msharifbd commented 1 month ago

Hi John: Thanks for your update. I run the following code -

url = 'https://www.sec.gov/Archives/edgar/data/1015383/000149315224023731/form10-k.htm'
html = download_sec_filing(url)
sec_filing = Filing (html)
sec_filing.parse()
item1c = sec_filing.find_section_from_title('Item 1C')
item1c_text = sec_filing.get_text_from_section(item1c, include_title=True)

I am trying to collect Item 1C. If you go to the url above, you can see Item 1C has two SUBSECTIONS - Risk Management and Strategy and Governance. I want ALL TEXTS under item 1C, including two SUBSECTIONS title, but you will see in my code - the final output item1c_text does not include anything from second SUBSECTION called Governance, but it is OK regarding first subsection - Risk Management and Strategy.

I am not sure how to use your code - filing.get_subsections_from_section(section).

Thanks again for your help.

john-friedman commented 1 month ago

Good catch! In this last update I modified the xml tree construction to be faster. In the code changeover, I masked the parsing_type variable. Fixing now.

john-friedman commented 1 month ago

Should be fixed with the newest version of sec-parsers v0.540.

from sec_parsers import Filing,download_sec_filing,set_headers
set_headers('John Smith','johnsmoth@example.com')
url = 'https://www.sec.gov/Archives/edgar/data/1015383/000149315224023731/form10-k.htm'
html = download_sec_filing(url)
sec_filing = Filing (html)
sec_filing.parse()
item1c = sec_filing.find_section_from_title('Item 1C')
item1c_text = sec_filing.get_text_from_section(item1c, include_title=True)

subsections = sec_filing.get_subsections_from_section(item1c)
print([item.attrib['title'] for item in subsections])

Output

['Risk management and strategy', 'Governance']
msharifbd commented 1 month ago

Thanks for the response. I tried the following code -

from sec_parsers import Filing,download_sec_filing,set_headers
set_headers('My Name','myemail@outlook.com')
url = 'https://www.sec.gov/Archives/edgar/data/1009759/000155837024009109/tmb-20230331x10k.htm'
html = download_sec_filing(url)
sec_filing = Filing (html)
sec_filing.parse()
item1c = sec_filing.find_section_from_title('Item 1C')
item1c_text = sec_filing.get_text_from_section(item1c, include_title=True)

subsections = sec_filing.get_subsections_from_section(item1c)
print([item.attrib['title'] for item in subsections])

and it shows the following error -

AttributeError: 'NoneType' object has no attribute 'attrib'
Cell In[98], line 1
----> 1 item1c_text = sec_filing.get_text_from_section(item1c, include_title=True)
Show Traceback

The problem is that there is no section Item 1C in this document. In this case, I want the output None. When I put this code in for loop, it is ok for item1c_text as I got none, but for subsections I got the following error message -

AttributeError: 'NoneType' object has no attribute 'getchildren'
Cell In[99], line 1
----> 1 subsections = sec_filing.get_subsections_from_section(item1c)
Show Traceback

In the for loop , I got the same error for subsections.

Second Issue -

url = 'https://www.sec.gov/Archives/edgar/data/350868/000035086824000016/iti-20240331.htm'
url = 'https://www.sec.gov/Archives/edgar/data/88948/000143774924020198/senea20240331_10k.htm'

when I use either of the above url for the code, item1c_text is fine, but the problem is subsections. It shows ONLY one title - Governance one is missing, but the item 1c has two subtitles.

john-friedman commented 1 month ago

I'm a bit confused by your question:

  1. The URL listed has no item 1c, so find_section_from_title returns None. Calling get_text_from_section or get_subsections_from_section on a None object returns an error. What is the desired behavior you want here for None objects?

  2. This is an interesting problem, but out of the scope of this issue. I've opened a new issue addressing the problem. In the meantime you can use get_nested_subsections_from_section to get all descendant subsections. The name is a WIP, so if you have a more descriptive name, let me know.

item1c = sec_filing.find_section_from_title('Item 1C')
nested_subsections = sec_filing.get_nested_subsections_from_section(item1c)
print([item.attrib['title'] for item in nested_subsections])
msharifbd commented 1 month ago

Hi, Thanks for your response.

  1. It is OK. I misunderstood the issue.
  2. It is still going on. The name could be like - get_subsection_title_from_section.
john-friedman commented 1 month ago
  1. Can you clarify? I'm unsure what issue is ongoing.
msharifbd commented 1 month ago

Sorry for the confusion. Assume I have these url from which I would like to collect the data -

https://www.sec.gov/Archives/edgar/data/1015383/000149315224023731/form10-k.htm
https://www.sec.gov/Archives/edgar/data/1009759/000155837024009109/tmb-20230331x10k.htm
https://www.sec.gov/Archives/edgar/data/1126741/000155837024009139/gsit-20240331x10k.htm
https://www.sec.gov/Archives/edgar/data/350868/000035086824000016/iti-20240331.htm
https://www.sec.gov/Archives/edgar/data/1616262/000095017024072997/rmcf-20240229.htm
https://www.sec.gov/Archives/edgar/data/764630/000149315224023724/form10-k.htm
https://www.sec.gov/Archives/edgar/data/88948/000143774924020198/senea20240331_10k.htm

Now when I run the following for loop function, it shows error -

# Create a list to store the Item 1c text
item1c_texts = []

# Iterate over each filing
for n, filing in enumerate(filings):
    url = filing['url']

    # Download and parse the filing
    html = download_sec_filing(url)
    sec_filing = Filing(html)
    sec_filing.parse()

    # Extract the text for Item 1c
    item1c = sec_filing.find_section_from_title('Item 1C')
    subsections = sec_filing.get_nested_subsections_from_section(item1c)

    # Bypass None
    try:
        item1c_text = sec_filing.get_text_from_section(item1c, include_title=True)
        subsections_title = print([item.attrib['title'] for item in subsections])
    except:
        item1c_text = None
        subsections_title = None

    item1c_texts.append({

        'Item 1c Text': item1c_text,
        'subsections_title':subsections_title,
        'url': url 

    })

# Create a DataFrame from the Item 1c text data
item1c_df = pd.DataFrame(item1c_texts)

It shows the following error -

AttributeError: 'NoneType' object has no attribute 'iterdescendants'
Cell In[94], line 13
     11 # Extract the text for Item 1c
     12 item1c = sec_filing.find_section_from_title('Item 1C')
---> 13 subsections = sec_filing.get_nested_subsections_from_section(item1c)
     14 # Bypass None
     15 try:
Show Traceback

I think my loop function is not correct as I am not very good at python.

john-friedman commented 1 month ago

No worries, we all start somewhere. The issue with your code is nonetype handling. This might help:

if item1c is not None:
   # process
else:
  #ignore

I'm going to close the issue now, but before I do, I'm curious what your project is. Could you tell me a little about it?

msharifbd commented 1 month ago

I am trying to collect different sections of 10K and use it for my research purpose.

john-friedman commented 1 month ago

Very secretive! Well, good luck with it. I'm closing the issue now, as its been resolved.

msharifbd commented 1 month ago

Thanks for your help. With your loop function it works. Actually, I am collecting the items for my academic research.

john-friedman commented 1 month ago

That's great! One of the goals of sec-parsers is to enable academic research. Happy to feature it in the readme when it's done. Best, John

msharifbd commented 1 month ago

Thanks for all your help. Can I contact you again if I face any new issue?

Thanks

Sharif

On Jul 25, 2024 at 6:01 PM, <John Friedman @.***)> wrote:

That's great! One of the goals of sec-parsers is to enable academic research. Happy to feature it in the readme when it's done. Best, John

— Reply to this email directly, view it on GitHub (https://github.com/john-friedman/SEC-Parsers/issues/3#issuecomment-2251538479), or unsubscribe (https://github.com/notifications/unsubscribe-auth/APLEXEQ7WCZISF5EK4PIDH3ZOF7S3AVCNFSM6AAAAABLGBUBQOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJRGUZTQNBXHE). You are receiving this because you authored the thread.Message ID: @.***>

john-friedman commented 1 month ago

Sure. Feel free to submit feature requests as well. (e.g. this function would be useful, or can you support x filing)

msharifbd commented 1 month ago

Thank you very much. I appreciate it. I will definitely let you know.

Warm Regards,

Sharif

On Jul 25, 2024 at 6:32 PM, <John Friedman @.***)> wrote:

Sure. Feel free to submit feature requests as well. (e.g. this function would be useful, or can you support x filing)

— Reply to this email directly, view it on GitHub (https://github.com/john-friedman/SEC-Parsers/issues/3#issuecomment-2251568395), or unsubscribe (https://github.com/notifications/unsubscribe-auth/APLEXETLPMBNSLX3EI2NNOLZOGDHVAVCNFSM6AAAAABLGBUBQOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJRGU3DQMZZGU). You are receiving this because you authored the thread.Message ID: @.***>

msharifbd commented 1 month ago

Hi, When I try to parse a large number of 10K, it breaks specifically for the following url.

url = 'https://www.sec.gov/Archives/edgar/data/1793659/000179365923000010/rsi-20221231.htm'
html = download_sec_filing(url)
sec_filing = Filing(html)
sec_filing.parse()

The error from the above code is -

AttributeError: 'NoneType' object has no attribute 'iterative_parse'
Cell In[6], line 1
----> 1 sec_filing.parse()
Show Traceback

Please note that this is not an issue of for loop as I check the above code alone.

Do you have any idea how it can be fixed? Thanks

john-friedman commented 1 month ago

Oh that's an interesting bug! The metadata for the SEC filing has a mistake '10' instead of '10-K'. You can fix it by manually setting filing_type

from sec_parsers import Filing,download_sec_filing,set_headers
set_headers('My Name','myemail@outlook.com')
url = 'https://www.sec.gov/Archives/edgar/data/1793659/000179365923000010/rsi-20221231.htm'
html = download_sec_filing(url)
sec_filing = Filing(html)
sec_filing.set_filing_type('10-K')
sec_filing.parse()
print(sec_filing.get_title_tree())

Btw, update your version of the package. (The code for setting filing_type had a bug that I've now fixed).

P.S. Next time, can you open a new issue to post the bug? Helps with organization.

msharifbd commented 1 month ago

Thanks for your response. Definitely I will open a new issue next time if there are similar kinds of issues. By the way, one last question - using sec-parsers, is there any way to get information from 10K like cik, url, filing_date, reporting_date, company name while parsing the document? To get those information, I am using other tools, but your module is very fast I realize.

john-friedman commented 1 month ago

Not yet, but that's a good idea. Moving this thread to https://github.com/john-friedman/SEC-Parsers/issues/7

john-friedman commented 14 hours ago

btw @msharifbd, its probably no longer useful to you, but bulk downloading is now available using the datamule package https://github.com/john-friedman/datamule-python