uktrade / stream-read-xbrl

Python package to parse Companies House accounts data in a streaming way
https://stream-read-xbrl.docs.trade.gov.uk/
MIT License
17 stars 2 forks source link

Redirect response '301 Moved Permanently' #173

Closed hinas-source closed 3 months ago

hinas-source commented 3 months ago

When I run the below code:

import httpx from stream_read_xbrl import stream_read_xbrl_zip

A URL taken from http://download.companieshouse.gov.uk/en_accountsdata.html

if name == 'main': url = 'http://download.companieshouse.gov.uk/Accounts_Bulk_Data-2023-03-02.zip' with \ httpx.stream('GET', url) as r, \ stream_read_xbrl_zip(r.iter_bytes(chunk_size=65536)) as (columns, rows): r.raise_for_status() for row in rows: print(row)

I am getting the below error

HTTPStatusError: Redirect response '301 Moved Permanently' for url 'http://download.companieshouse.gov.uk/Accounts_Bulk_Data-2023-03-02.zip' Redirect location: 'https://download.companieshouse.gov.uk/Accounts_Bulk_Data-2023-03-02.zip' For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/301

michalc commented 3 months ago

Hi @hinas-source,

It looks like you're fetching from HTTP, but the server is returning a redirect to HTTPS, and httpx does not follow redirects by default. So you have 2 choices.

You could fetch from HTTPS in the first place, as in this:

import httpx
from stream_read_xbrl import stream_read_xbrl_zip

if __name__ == '__main__':
    url = 'https://download.companieshouse.gov.uk/Accounts_Bulk_Data-2024-03-26.zip'
    with httpx.stream('GET', url) as r:
        r.raise_for_status()
        with stream_read_xbrl_zip(r.iter_bytes(chunk_size=65536)) as (columns, rows):
            for row in rows:
                print(row)

Or, you can configure httpx to follow the redirect, as in this:

import httpx
from stream_read_xbrl import stream_read_xbrl_zip

if __name__ == '__main__':
    url = 'http://download.companieshouse.gov.uk/Accounts_Bulk_Data-2024-03-26.zip'
    with httpx.stream('GET', url, follow_redirects=True) as r:
        r.raise_for_status()
        with stream_read_xbrl_zip(r.iter_bytes(chunk_size=65536)) as (columns, rows):
            for row in rows:
                print(row)

My recommendation is the first of these - better to always just use HTTPS if you can from a security point of view, and avoids the redirect, so a (very small) time saving.

(Neither of these use the exact same URL as in your question, as I think it no longer exists - it returns a 404)

hinas-source commented 3 months ago

Thank you for you help @michalc

michalc commented 3 months ago

No problem!

hinas-source commented 3 months ago

`--------------------------------------------------------------------------- UnexpectedSignatureError Traceback (most recent call last) Cell In[5], line 9 5 url = f"https://download.companieshouse.gov.uk/Accounts_Bulk_Data-2024-01-20.zip" 6 with \ 7 httpx.stream('GET', url) as r, \ 8 stream_read_xbrl_zip(r.iter_bytes(chunk_size=65536)) as (columns, rows): ----> 9 df = pd.DataFrame(rows, columns=columns) 10 if isinstance(df, pd.DataFrame): 11 df1 = df

File c:\Users\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\frame.py:832, in DataFrame.init(self, data, index, columns, dtype, copy) 830 data = np.asarray(data) 831 else: --> 832 data = list(data) 833 if len(data) > 0: 834 if is_dataclass(data[0]):

File c:\Users\AppData\Local\Programs\Python\Python312\Lib\site-packages\stream_read_xbrl.py:556, in (.0) 553 yield queue.popleft().result() 555 with ProcessPoolExecutor(max_workers=num_workers) as executor: --> 556 yield _COLUMNS, ( 557 row + (zip_url,) 558 for results in imap(executor, _xbrl_torows, ((name.decode(), b''.join(chunks)) for name, , chunks in stream_unzip(zip_bytes_iter))) 559 for row in results 560 )

File c:\Users\AppData\Local\Programs\Python\Python312\Lib\site-packages\stream_read_xbrl.py:546, in stream_read_xbrl_zip..imap(executor, func, param_iterables) 545 def imap(executor, func, param_iterables): --> 546 for params in param_iterables: 547 if len(queue) == num_workers: 548 yield queue.popleft().result()

File c:\Users\AppData\Local\Programs\Python\Python312\Lib\site-packages\stream_read_xbrl.py:558, in (.0) 553 yield queue.popleft().result() 555 with ProcessPoolExecutor(max_workers=num_workers) as executor: 556 yield _COLUMNS, ( 557 row + (zip_url,) --> 558 for results in imap(executor, _xbrl_torows, ((name.decode(), b''.join(chunks)) for name, , chunks in stream_unzip(zip_bytes_iter))) 559 for row in results 560 )

File c:\Users\AppData\Local\Programs\Python\Python312\Lib\site-packages\stream_unzip.py:460, in stream_unzip(zipfile_chunks, password, chunk_size, allow_zip64) 457 else: 458 raise UnexpectedSignatureError(signature) --> 460 for file_name, file_size, unzipped_chunks in all(): 461 yield file_name, file_size, unzippedchunks 462 for in unzipped_chunks:

File c:\Users\AppData\Local\Programs\Python\Python312\Lib\site-packages\stream_unzip.py:458, in stream_unzip..all() 456 break 457 else: --> 458 raise UnexpectedSignatureError(signature)

UnexpectedSignatureError: b'<htm'`

I am getting this error

michalc commented 3 months ago

@hinas-source This seems like a different issue - can you raise a new issue at https://github.com/uktrade/stream-read-xbrl/issues?