Doesn't seem to work with local XSD files

jamiehannaford commented 3 years ago

I have the following files:

$ ls data/TSLA/10-k/20201231/
tsla-10k_20201231_htm.xml tsla-20201231_cal.xml     tsla-20201231_lab.xml
tsla-20201231.xsd         tsla-20201231_def.xml     tsla-20201231_pre.xml

But when I try to load one, it fails:

from xbrl_parser.instance import parse_xbrl, parse_xbrl_url, XbrlInstance
from xbrl_parser.cache import HttpCache
import logging
logging.basicConfig(level=logging.INFO)
cache: HttpCache = HttpCache('./cache/')

# parse from path
instance_path = './data/TSLA/10-k/20201231/tsla-10k_20201231_htm.xml'
inst1 = parse_xbrl(instance_path, None, './data/TSLA/10-k/20201231')

Traceback (most recent call last):
  File "./test.py", line 10, in <module>
    inst1 = parse_xbrl(instance_path, cache, './data/TSLA/10-k/20201231')
  File "/Users/jamie/.pyenv/versions/3.8.7/lib/python3.8/site-packages/xbrl_parser/instance.py", line 281, in parse_xbrl
    taxonomy: TaxonomySchema = parse_taxonomy(cache, schema_url)
  File "/Users/jamie/.pyenv/versions/3.8.7/lib/python3.8/site-packages/xbrl_parser/taxonomy.py", line 202, in parse_taxonomy
    schema_path: str = cache.cache_file(schema_url)
  File "/Users/jamie/.pyenv/versions/3.8.7/lib/python3.8/site-packages/xbrl_parser/cache.py", line 75, in cache_file
    query_response = requests.get(file_url)
  File "/Users/jamie/.pyenv/versions/3.8.7/lib/python3.8/site-packages/requests/api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "/Users/jamie/.pyenv/versions/3.8.7/lib/python3.8/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/jamie/.pyenv/versions/3.8.7/lib/python3.8/site-packages/requests/sessions.py", line 528, in request
    prep = self.prepare_request(req)
  File "/Users/jamie/.pyenv/versions/3.8.7/lib/python3.8/site-packages/requests/sessions.py", line 456, in prepare_request
    p.prepare(
  File "/Users/jamie/.pyenv/versions/3.8.7/lib/python3.8/site-packages/requests/models.py", line 316, in prepare
    self.prepare_url(url, params)
  File "/Users/jamie/.pyenv/versions/3.8.7/lib/python3.8/site-packages/requests/models.py", line 390, in prepare_url
    raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL './data/TSLA/10-k/20201231/tsla-20201231.xsd': No schema supplied. Perhaps you meant http://./data/TSLA/10-k/20201231/tsla-20201231.xsd?

If I leave out instance_url I get this error instead:

Traceback (most recent call last):
  File "./test.py", line 10, in <module>
    inst1 = parse_xbrl(instance_path, cache)
  File "/Users/jamie/.pyenv/versions/3.8.7/lib/python3.8/site-packages/xbrl_parser/instance.py", line 276, in parse_xbrl
    schema_url = resolve_uri(instance_url, schema_uri)
  File "/Users/jamie/.pyenv/versions/3.8.7/lib/python3.8/site-packages/xbrl_parser/helper/uri_resolver.py", line 23, in resolve_uri
    if '.' in dir_uri.split('/')[-1]:
AttributeError: 'NoneType' object has no attribute 'split'

It seems that if you use the HTTP cacher, it tries to load everything from there and raises a fatal error if files aren't found. My understanding was that it'd be optional.

manusimidt commented 3 years ago

Thank you for your issue!

No, the Http cache is not optional at this time. Even if you have downloaded the instance file and/or the files from the extension taxonomy, the parser must also download all the taxonomies and their files that are imported by the XBRL instance file.

For submissions from the SEC this includes for example the US-GAAP taxonomy, the DEI Taxonomy and the SRT Taxonomy.

These standard taxonomies can be pretty huge (i.e: US-GAAP 2020 has about 18 MB of xml files) thus caching is required when parsing multiple taxonomies. (you don't want do download the same standard taxonomy again and again for every of your 1000 submissions).

I got your example running with the following code:

logging.basicConfig(level=logging.INFO)
cache: HttpCache = HttpCache('./cache/')

# parse from path
instance_path = './data/TSLA/tsla-10k_20201231_htm.xml'
inst1 = parse_xbrl(instance_path, cache, 'https://www.sec.gov/Archives/edgar/data/1318605/000156459021004599/')

Currently you have to define the base url to the submission because the Taxonomyschema is imported with a relative path in the instance file. i.e:

<link:schemaRef xlink:href="./tsla-20201231.xsd" xlink:type="simple"/>

But you are correct, this is very inconvenient if you have already downloaded the files of the extension taxonomy. The parser should at least try to find the schema file in the current directory or the instance file you want to parse.

I will implement this in the next days.

jamiehannaford commented 3 years ago

The parser should at least try to find the schema file in the current directory or the instance file you want to parse.

Awesome, thank you. It'd be great if the parser could find the schema locally. Great job on the project, I'm finding it super helpful!

manusimidt commented 3 years ago

Will do some further testing and documentation and then upload a new package version to pypi in the next 2-3 days.

manusimidt commented 3 years ago

It should now work with the new package version 1.2.0. I used the following code to get your example running:


from xbrl_parser.instance import parse_xbrl
from xbrl_parser.cache import HttpCache
import logging
logging.basicConfig(level=logging.INFO)
cache: HttpCache = HttpCache('./../cache/')
# cache.set_headers({'From': '', 'User-Agent': 'py-xbrl/1.1.4'})

# parse from path
instance_path = './data/TSLA/10-k/20201231/tsla-10k_20201231_htm.xml'
inst1 = parse_xbrl(instance_path, cache)
print(inst1)

I also tested on ~100 other SEC EDGAR submissions, both XBRL and iXBRL and it worked pretty reliabily. Nevertheless, I would be happy if you give me feedback if it works for you.

jamiehannaford commented 3 years ago

Thanks @manusimidt. I'll try using this weekend and reopen if I have any issues.

manusimidt / py-xbrl

Doesn't seem to work with local XSD files #8