leoliu0 / cik-cusip-mapping

provide cik to cusip links using 13G and 13D filings
151 stars 48 forks source link

Error #6

Closed vicn1222 closed 2 years ago

vicn1222 commented 2 years ago

Hi,

Great work! Thanks

I download your codes for test. But it fails.

I modify dl_idx.py so it only download for 2004 as:

for year in range(2004, 2005):

It product more than 500,000 lines in full_index.csv. I delete 450,000 lines so it only have about 50,000 filing for test.

I then run "python dl.py 13G 13G", it download all html file in 13G folder, such as 1007853_2004-03-23.html. It doesn't download the form 15G filing. When I run "python parse_cusip.py 13G", it product 13G.csv like below:

13G/2004_02/1000045_2004-02-26.html,, 13G/2004_02/1000097_2004-02-03.html,, 13G/2004_02/1000180_2004-02-13.html,, 13G/2004_02/1000180_2004-02-11.html,, 13G/2004_02/1000209_2004-02-04.html,, 13G/2004_02/1000209_2004-02-12.html,, 13G/2004_02/1000209_2004-02-13.html,, 13G/2004_02/1000227_2004-02-12.html,, 13G/2004_02/1000227_2004-02-13.html,,

There are no cik, cusip. etc...

What do I do wrong?

Thank you!

vicn1222 commented 2 years ago

Ok,

I delete all the htm in the 13G folder, and manually download 1 filing at https://www.sec.gov/Archives/edgar/data/1759008/000139594221000083/0001395942-21-000083-index.html.

The filing file I download is https://www.sec.gov/Archives/edgar/data/1395942/000139594221000083/0001395942-21-000083.txt

I then run "python parse_cusip.py 13G".

The csv output is 13G/2004_01/0001395942-21-000083.txt,0001759008,8096c3403

The cusip is totally wrong. It should be 1142552108

I think your REGEXP for cusip is not correct...

Also, the 9th digit (check sum) should be checked to ensure the cusip is valid.

leoliu0 commented 2 years ago

Thanks for the issue. It seems the example you give has 10 digits cusip which is invalid.

The script can only takes the best guess of the cusip, so it does not guarantee the 100% accuracy. One should check whether such cusip exists in other database and discard ones non exist.

I will add 9 digit checksum. It only help a little because not many cusips here are 9 digits.

I will check the download script

vicn1222 commented 2 years ago

This one didn't parse either, which has 9 digits. https://www.sec.gov/Archives/edgar/data/1337013/000125110921000004/0001251109-21-000004-index.html.

Is it becasue the format is in html?

leoliu0 commented 2 years ago

Ya, it is because of HTML. there are just too many junks in HTML tags that messed up the process. However, the program should be able to parse HTML as all the txt are essentially HTML after 2011. I wrote parse_cusip_html.py to have more effort cleaning up HTML. Try that (at least it parses your example file correctly). Currently, I don't have a gauge on the impact of the fix on other files. So this is a separate script. When I have time, I will try the fix on the entire 13G/F universe to see whether it is working properly. If so, this script should replace parse_cusip.py

vicn1222 commented 2 years ago

Thanks! Excellent works!

vicn1222 commented 2 years ago

Don't want you to waste time. Those 13D/G are really in random formats....

leoliu0 commented 2 years ago

It’s fine. Please keep feeding examples for me to improve the code if you want. It can only get better. I thank you for helping me make it better