Add find_entry_file for filling dir

mrx23dot commented 2 years ago

Usage:

zip_url = 'https://www.sec.gov/Archives/edgar/data/0001723128/000172312821000032/0001723128-21-000032-xbrl.zip' filling_dir = cache.cache_edgar_enclosure(zip_url) entryFile = cache.find_entry_file(filling_dir) print(entryFile)

manusimidt commented 2 years ago

Hey, thanks for your suggestions! I am always happy when someone shares contributions to py-xbrl. 😄

As I understand it, you want to use the find_entry_file() method to find the instance document inside a zip enclosure of SEC Edgar, right?

I am a little bit surprised that the array of valid file extensions contains .xsd. Please be aware that while the Instance Document can have many file extensions it will never be .xsd. .xsd is the one and only file extension used for XML Schema documents. In the case of XBRL these documents are taxonomy schemas.

Here an example: SEC Edgar (USA) has XBRL and iXBRL documents that either the .xml or the .htm extension. Company House (UK) mainly has iXBRL documents that have the extension .html. DART (South Korea) mainly has XBRL filings that have the extension .xbrl. But all share the same extension for taxonomy schemas .xsd.

Additionally the structure of the folders the folders that contain the Instance Documents varys greatly because there are thousands of different institutions and regulators that use the XBRL standard. The goal of py-xbrl is to provide a library that can handle all kinds of XBRL documents. Therefore I have deliberately not included any code that only refers to SEC Edgar.

I am aware that SEC Edgar is a very popular source for XBRL documents, but py-xbrl is fortunately also used for XBRL documents from other sources. For example to parse non-public submissions from the dutch tax authority or to parse public submissions from the UK company house.

Therefore I would always suggest to add only code to this library which is generally applicable to all XBRL files. If we wan't to implement some type of function that checks if a given url is a valid Instance Document we have to look into the file and check if the file contains the necessary elements of a Instance Document.

mrx23dot commented 2 years ago

Yeah the goal was to find the entry file in unzipped (SEC) filling.

The filter was needed because there are other file types that are not included in other files. I could include the other types as well and choose the biggest file, otherwise it's a nogo, maybe return a list of files in decreasing probability.

It's funny that it's a very complex data structure, which doesn't describe an entry point :D

manusimidt / py-xbrl

Add find_entry_file for filling dir #61