Closed jamiehannaford closed 3 years ago
Thank you for your issue!
No, the Http cache is not optional at this time. Even if you have downloaded the instance file and/or the files from the extension taxonomy, the parser must also download all the taxonomies and their files that are imported by the XBRL instance file.
For submissions from the SEC this includes for example the US-GAAP taxonomy, the DEI Taxonomy and the SRT Taxonomy.
These standard taxonomies can be pretty huge (i.e: US-GAAP 2020 has about 18 MB of xml files) thus caching is required when parsing multiple taxonomies. (you don't want do download the same standard taxonomy again and again for every of your 1000 submissions).
I got your example running with the following code:
logging.basicConfig(level=logging.INFO)
cache: HttpCache = HttpCache('./cache/')
# parse from path
instance_path = './data/TSLA/tsla-10k_20201231_htm.xml'
inst1 = parse_xbrl(instance_path, cache, 'https://www.sec.gov/Archives/edgar/data/1318605/000156459021004599/')
Currently you have to define the base url to the submission because the Taxonomyschema is imported with a relative path in the instance file. i.e:
<link:schemaRef xlink:href="./tsla-20201231.xsd" xlink:type="simple"/>
But you are correct, this is very inconvenient if you have already downloaded the files of the extension taxonomy. The parser should at least try to find the schema file in the current directory or the instance file you want to parse.
I will implement this in the next days.
The parser should at least try to find the schema file in the current directory or the instance file you want to parse.
Awesome, thank you. It'd be great if the parser could find the schema locally. Great job on the project, I'm finding it super helpful!
Will do some further testing and documentation and then upload a new package version to pypi in the next 2-3 days.
It should now work with the new package version 1.2.0. I used the following code to get your example running:
from xbrl_parser.instance import parse_xbrl
from xbrl_parser.cache import HttpCache
import logging
logging.basicConfig(level=logging.INFO)
cache: HttpCache = HttpCache('./../cache/')
# cache.set_headers({'From': '', 'User-Agent': 'py-xbrl/1.1.4'})
# parse from path
instance_path = './data/TSLA/10-k/20201231/tsla-10k_20201231_htm.xml'
inst1 = parse_xbrl(instance_path, cache)
print(inst1)
I also tested on ~100 other SEC EDGAR submissions, both XBRL and iXBRL and it worked pretty reliabily. Nevertheless, I would be happy if you give me feedback if it works for you.
Thanks @manusimidt. I'll try using this weekend and reopen if I have any issues.
I have the following files:
But when I try to load one, it fails:
If I leave out
instance_url
I get this error instead:It seems that if you use the HTTP cacher, it tries to load everything from there and raises a fatal error if files aren't found. My understanding was that it'd be optional.