Closed msharifbd closed 1 month ago
Hi Sharif, subtitles are supported. To see the full tree use print(filing.get_title_tree())
The following code should work:
filing.get_node_text(filing.find_nodes_by_title('Risk Management and Strategy')[0])
filing.get_node_text(filing.find_nodes_by_title('Governance')[0])
Hope this helps!
Hi,
This title has some little variations in different 10-Ks like Risk Management
, Cybersecurity Risk Management
and so on. I am trying to download a large number of Item 1Cs since December 15, 2023. Is there any way when I parse by title like - filing.find_nodes_by_title('item 1c')
, those title will come in the text. Also, please note that some of the Item 1Cs do not have such titles.
Also, Is there any way we can also search the subtitle in Item 1C by index such as [0]?
Oh this is great! I haven't spent much time yet building quality of life features for users yet, so this conversation is very useful.
Let me make sure I understand what you want:
risk management and strategy
as well as governance
Is that correct?
Yes, you got it. Now, when I extract item 1C using ‘sec-parsers’ all texts come in that items except any subtitle, if any. But I want when I extract item 1C text, the text will include subtitles as well. Thanks
Great! I'll answer your question, and then I have a follow-up for you:
SEC Parsers parses filings into xml, which you can work with by selecting filing.xml
.getchildren()
This probably isn't a satisfying answer, so I'll add a built-in method in future updates. I'm currently working on a major update, so it might be a week or so.
Do you have any more feature requests?
Thank you very much. Right now, I do not have any more feature requests. Once you update, can you please give me a knock? Thanks again for your help and for your nice package.
Will do. Glad you're enjoying the package. I'll mark this issue complete after the update.
Hi Sharif, I just updated the package.
Let me know if this works for you, and I'll close the issue. Btw, I have added you to the contributors.md.
Hi John: Thanks for your update. I run the following code -
url = 'https://www.sec.gov/Archives/edgar/data/1015383/000149315224023731/form10-k.htm'
html = download_sec_filing(url)
sec_filing = Filing (html)
sec_filing.parse()
item1c = sec_filing.find_section_from_title('Item 1C')
item1c_text = sec_filing.get_text_from_section(item1c, include_title=True)
I am trying to collect Item 1C. If you go to the url
above, you can see Item 1C has two SUBSECTIONS
- Risk Management and Strategy
and Governance
. I want ALL TEXTS
under item 1C, including two SUBSECTIONS
title
, but you will see in my code - the final output item1c_text
does not include anything from second SUBSECTION
called Governance
, but it is OK regarding first subsection - Risk Management and Strategy
.
I am not sure how to use your code - filing.get_subsections_from_section(section)
.
Thanks again for your help.
Good catch! In this last update I modified the xml tree construction to be faster. In the code changeover, I masked the parsing_type variable. Fixing now.
Should be fixed with the newest version of sec-parsers v0.540.
from sec_parsers import Filing,download_sec_filing,set_headers
set_headers('John Smith','johnsmoth@example.com')
url = 'https://www.sec.gov/Archives/edgar/data/1015383/000149315224023731/form10-k.htm'
html = download_sec_filing(url)
sec_filing = Filing (html)
sec_filing.parse()
item1c = sec_filing.find_section_from_title('Item 1C')
item1c_text = sec_filing.get_text_from_section(item1c, include_title=True)
subsections = sec_filing.get_subsections_from_section(item1c)
print([item.attrib['title'] for item in subsections])
Output
['Risk management and strategy', 'Governance']
Thanks for the response. I tried the following code -
from sec_parsers import Filing,download_sec_filing,set_headers
set_headers('My Name','myemail@outlook.com')
url = 'https://www.sec.gov/Archives/edgar/data/1009759/000155837024009109/tmb-20230331x10k.htm'
html = download_sec_filing(url)
sec_filing = Filing (html)
sec_filing.parse()
item1c = sec_filing.find_section_from_title('Item 1C')
item1c_text = sec_filing.get_text_from_section(item1c, include_title=True)
subsections = sec_filing.get_subsections_from_section(item1c)
print([item.attrib['title'] for item in subsections])
and it shows the following error -
AttributeError: 'NoneType' object has no attribute 'attrib'
Cell In[98], line 1
----> 1 item1c_text = sec_filing.get_text_from_section(item1c, include_title=True)
Show Traceback
The problem is that there is no section Item 1C in this document. In this case, I want the output None
. When I put this code in for loop
, it is ok for item1c_text
as I got none, but for subsections
I got the following error message -
AttributeError: 'NoneType' object has no attribute 'getchildren'
Cell In[99], line 1
----> 1 subsections = sec_filing.get_subsections_from_section(item1c)
Show Traceback
In the for loop
, I got the same error for subsections
.
Second Issue -
url = 'https://www.sec.gov/Archives/edgar/data/350868/000035086824000016/iti-20240331.htm'
url = 'https://www.sec.gov/Archives/edgar/data/88948/000143774924020198/senea20240331_10k.htm'
when I use either of the above url for the code, item1c_text
is fine, but the problem is subsections
. It shows ONLY one title - Governance one is missing, but the item 1c has two subtitles.
I'm a bit confused by your question:
The URL listed has no item 1c, so find_section_from_title
returns None. Calling get_text_from_section
or get_subsections_from_section
on a None object returns an error. What is the desired behavior you want here for None objects?
This is an interesting problem, but out of the scope of this issue. I've opened a new issue addressing the problem. In the meantime you can use get_nested_subsections_from_section
to get all descendant subsections. The name is a WIP, so if you have a more descriptive name, let me know.
item1c = sec_filing.find_section_from_title('Item 1C')
nested_subsections = sec_filing.get_nested_subsections_from_section(item1c)
print([item.attrib['title'] for item in nested_subsections])
Hi, Thanks for your response.
get_subsection_title_from_section
. Sorry for the confusion. Assume I have these url from which I would like to collect the data -
https://www.sec.gov/Archives/edgar/data/1015383/000149315224023731/form10-k.htm
https://www.sec.gov/Archives/edgar/data/1009759/000155837024009109/tmb-20230331x10k.htm
https://www.sec.gov/Archives/edgar/data/1126741/000155837024009139/gsit-20240331x10k.htm
https://www.sec.gov/Archives/edgar/data/350868/000035086824000016/iti-20240331.htm
https://www.sec.gov/Archives/edgar/data/1616262/000095017024072997/rmcf-20240229.htm
https://www.sec.gov/Archives/edgar/data/764630/000149315224023724/form10-k.htm
https://www.sec.gov/Archives/edgar/data/88948/000143774924020198/senea20240331_10k.htm
Now when I run the following for loop
function, it shows error -
# Create a list to store the Item 1c text
item1c_texts = []
# Iterate over each filing
for n, filing in enumerate(filings):
url = filing['url']
# Download and parse the filing
html = download_sec_filing(url)
sec_filing = Filing(html)
sec_filing.parse()
# Extract the text for Item 1c
item1c = sec_filing.find_section_from_title('Item 1C')
subsections = sec_filing.get_nested_subsections_from_section(item1c)
# Bypass None
try:
item1c_text = sec_filing.get_text_from_section(item1c, include_title=True)
subsections_title = print([item.attrib['title'] for item in subsections])
except:
item1c_text = None
subsections_title = None
item1c_texts.append({
'Item 1c Text': item1c_text,
'subsections_title':subsections_title,
'url': url
})
# Create a DataFrame from the Item 1c text data
item1c_df = pd.DataFrame(item1c_texts)
It shows the following error -
AttributeError: 'NoneType' object has no attribute 'iterdescendants'
Cell In[94], line 13
11 # Extract the text for Item 1c
12 item1c = sec_filing.find_section_from_title('Item 1C')
---> 13 subsections = sec_filing.get_nested_subsections_from_section(item1c)
14 # Bypass None
15 try:
Show Traceback
I think my loop function is not correct as I am not very good at python.
No worries, we all start somewhere. The issue with your code is nonetype handling. This might help:
if item1c is not None:
# process
else:
#ignore
I'm going to close the issue now, but before I do, I'm curious what your project is. Could you tell me a little about it?
I am trying to collect different sections of 10K and use it for my research purpose.
Very secretive! Well, good luck with it. I'm closing the issue now, as its been resolved.
Thanks for your help. With your loop function it works. Actually, I am collecting the items for my academic research.
That's great! One of the goals of sec-parsers
is to enable academic research. Happy to feature it in the readme when it's done. Best, John
Thanks for all your help. Can I contact you again if I face any new issue?
Thanks
Sharif
On Jul 25, 2024 at 6:01 PM, <John Friedman @.***)> wrote:
That's great! One of the goals of sec-parsers is to enable academic research. Happy to feature it in the readme when it's done. Best, John
— Reply to this email directly, view it on GitHub (https://github.com/john-friedman/SEC-Parsers/issues/3#issuecomment-2251538479), or unsubscribe (https://github.com/notifications/unsubscribe-auth/APLEXEQ7WCZISF5EK4PIDH3ZOF7S3AVCNFSM6AAAAABLGBUBQOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJRGUZTQNBXHE). You are receiving this because you authored the thread.Message ID: @.***>
Sure. Feel free to submit feature requests as well. (e.g. this function would be useful, or can you support x filing)
Thank you very much. I appreciate it. I will definitely let you know.
Warm Regards,
Sharif
On Jul 25, 2024 at 6:32 PM, <John Friedman @.***)> wrote:
Sure. Feel free to submit feature requests as well. (e.g. this function would be useful, or can you support x filing)
— Reply to this email directly, view it on GitHub (https://github.com/john-friedman/SEC-Parsers/issues/3#issuecomment-2251568395), or unsubscribe (https://github.com/notifications/unsubscribe-auth/APLEXETLPMBNSLX3EI2NNOLZOGDHVAVCNFSM6AAAAABLGBUBQOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJRGU3DQMZZGU). You are receiving this because you authored the thread.Message ID: @.***>
Hi, When I try to parse a large number of 10K, it breaks specifically for the following url.
url = 'https://www.sec.gov/Archives/edgar/data/1793659/000179365923000010/rsi-20221231.htm'
html = download_sec_filing(url)
sec_filing = Filing(html)
sec_filing.parse()
The error from the above code is -
AttributeError: 'NoneType' object has no attribute 'iterative_parse'
Cell In[6], line 1
----> 1 sec_filing.parse()
Show Traceback
Please note that this is not an issue of for loop
as I check the above code alone.
Do you have any idea how it can be fixed? Thanks
Oh that's an interesting bug! The metadata for the SEC filing has a mistake '10' instead of '10-K'. You can fix it by manually setting filing_type
from sec_parsers import Filing,download_sec_filing,set_headers
set_headers('My Name','myemail@outlook.com')
url = 'https://www.sec.gov/Archives/edgar/data/1793659/000179365923000010/rsi-20221231.htm'
html = download_sec_filing(url)
sec_filing = Filing(html)
sec_filing.set_filing_type('10-K')
sec_filing.parse()
print(sec_filing.get_title_tree())
Btw, update your version of the package. (The code for setting filing_type had a bug that I've now fixed).
P.S. Next time, can you open a new issue to post the bug? Helps with organization.
Thanks for your response. Definitely I will open a new issue next time if there are similar kinds of issues. By the way, one last question - using sec-parsers
, is there any way to get information from 10K like cik
, url
, filing_date
, reporting_date
, company name
while parsing the document? To get those information, I am using other tools, but your module is very fast I realize.
Not yet, but that's a good idea. Moving this thread to https://github.com/john-friedman/SEC-Parsers/issues/7
btw @msharifbd, its probably no longer useful to you, but bulk downloading is now available using the datamule package https://github.com/john-friedman/datamule-python
Hi, Is there any way to get the subtitle in the text of an Item. For example, using your module I am trying to parse Item 1C. I get everything ok, but the issue is that many of Item 1C in 10K has two subsections with a heading. I actually need those subsections heading (title). For example -
https://www.sec.gov/Archives/edgar/data/1785173/000095017024024008/etnb-20231231.htm
This above link 10K has Item 1C, and Item 1C has two subsection titles -
Risk Management and Strategy
, andGovernance
. when I parse using yoursec-parsers
, I easily get the text of Item 1C, but those two titles are not there. Is there any way I will get the text along with the two subsections title.Thanks for the nice module.