oasis-open / cti-taxii-client

OASIS TC Open Repository: TAXII 2 Client Library Written in Python
https://taxii2client.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
112 stars 53 forks source link

When returning an empty JSON, '{}' turns into a Chinese character #111

Open Ni-Knight opened 1 year ago

Ni-Knight commented 1 year ago

When receiving an empty response from a server the string '{}' is somewhere translated to Unicode so: "{" = U+007B "}" = U+007D Those are somewhere concatenated to return: "筽" = U+7B7D

To reproduce just send a query that returns an empty response from a TAXII server, curl and postman returns '{}' but taxii-client returns: '筽'.

For example: 2022-08-15T11:21:48.17891632Z info: (TAXII 2 Feed test_instance_1_TAXII 2 Feed test_taxii2-get-indicators) python logging: DEBUG [urllib3.connectionpool] - [https://ais2.cisa.dhs.gov:443](https://ais2.cisa.dhs.gov/) "GET /public/collections/---/objects/?limit=25&match%5Btype%5D=campaign HTTP/1.1" 200 2 2022-08-15T11:21:48.180585842Z debug: (TAXII 2 Feed test_instance_1_TAXII 2 Feed test_taxii2-get-indicators) GOT RESPONSE resp.content=b'{}' resp.text='筽' resp.status_code=200 resp.headers={'x-transaction-id': '124a663c-e7c5-48c0-a4ba-6fff95cab122', 'Strict-Transport-Security': 'max-age=31536000 ; includeSubDomains', 'Date': 'Mon, 15 Aug 2022 11:21:47 GMT', 'Keep-Alive': 'timeout=60', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Cache-Control': 'no-cache, no-store, max-age=0, must-revalidate', 'Pragma': 'no-cache', 'Expires': '0', 'X-Frame-Options': 'DENY', 'Content-Type': 'application/taxii+json;version=2.1', 'Content-Length': '2', 'Connection': 'keep-alive'}

chisholm commented 1 year ago

I am not familiar with 'pack', but if the problem seems specific to that tool, perhaps that tool is misunderstanding the textual encoding of the response? If that were the case though, it seems like it ought to misunderstand the encoding regardless of response content. It wouldn't be specific to "empty" responses.

Ni-Knight commented 1 year ago

@chisholm by pack I meant this package, i.e - taxiiclient.

chisholm commented 1 year ago

I'm not sure what you mean by taxii-client "returning" something. It's a library with classes and methods, and some methods do return things. It's not clear where the line you quoted came from (looks like a line of logging?). Can you provide a small code sample to reproduce the error?

I tried my own experiment which would produce an empty result, where I enabled a simple logging config to see what logging would get printed out, to compare to your output. It was run against the Medallion server:

import logging
import taxii2client

logging.basicConfig(level="DEBUG")

coll = taxii2client.Collection(
    "http://127.0.0.1:5000/trustgroup1/collections/91a7b528-80eb-42ed-a74d-c6fbd5a26116/",
    user="(user)", password="(password)"
)

envelope = coll.get_objects(
  type="foo"
)

print(envelope)

Notice I had to add my own print statement. The library has some error logging, but doesn't automatically log all of the HTTP responses.

I got:

DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 127.0.0.1:5000
DEBUG:urllib3.connectionpool:http://127.0.0.1:5000 "GET /trustgroup1/collections/91a7b528-80eb-42ed-a74d-c6fbd5a26116/ HTTP/1.1" 200 254
DEBUG:urllib3.connectionpool:Resetting dropped connection: 127.0.0.1
DEBUG:urllib3.connectionpool:http://127.0.0.1:5000 "GET /trustgroup1/collections/91a7b528-80eb-42ed-a74d-c6fbd5a26116/objects/?match%5Btype%5D=foo HTTP/1.1" 200 2
{}

The first four lines are logging output; the last line is my print statement. It shows the use of the taxii-client API to send a request to a TAXII server. There is no Chinese in the output.

Ni-Knight commented 1 year ago

Thank you for the reply, I'll try to recreate it again and update (I cant use the same server where we first saw it as we don't have creds to use it). Maybe its something with CISA server that causes the weird character.

BEAdi commented 1 year ago

Hi @chisholm, I am working with @Ni-Knight and wanted to share what we did.

It's not clear where the line you quoted came from (looks like a line of logging?). Can you provide a small code sample to reproduce the error?

When using a code that works the same as the code you added, we are getting an InvalidJSONError. So in order to see the content of the response, I changed our code to use the following method, instead of get_objects method of the collection. You can see that it works similar to get_objects, using the library's methods.

def v21_get_objects(self, accept="application/taxii+json;version=2.1", **filter_kwargs):
        collection = self.collection_to_fetch
        collection._verify_can_read()
        query_params = _filter_kwargs_to_query_params(filter_kwargs)
        merged_headers = collection._conn._merge_headers({"Accept": accept, "Content-Type": "application/taxii+json"})

        resp = collection._conn.session.get(collection.objects_url, headers=merged_headers, params=query_params)
        print(f'GOT RESPONSE {resp.content=} {resp.text=} {resp.status_code=} {resp.headers=}')
        if len(resp.text) <= len('{}'):  # in case it is not a json that has to have {}
            return {}

        return _to_json(resp)

We tried to reproduce it on another server, but it returns {}\n and not just {}. Maybe this is the case you also checked, and without the \n at the end of the response it will reproduce for you?

chisholm commented 1 year ago

Using your code, slightly modified as:

def v21_get_objects(collection, accept="application/taxii+json;version=2.1", **filter_kwargs):
    collection._verify_can_read()
    query_params = _filter_kwargs_to_query_params(filter_kwargs)
    merged_headers = collection._conn._merge_headers({"Accept": accept, "Content-Type": "application/taxii+json"})

    resp = collection._conn.session.get(collection.objects_url, headers=merged_headers, params=query_params)
    print(f'GOT RESPONSE {resp.content=} {resp.text=} {resp.status_code=} {resp.headers=} {resp.encoding=}')
    if len(resp.text) <= len('{}'):  # in case it is not a json that has to have {}
        return {}

    return _to_json(resp)

coll = taxii2client.Collection(
    "http://127.0.0.1:5000/trustgroup1/collections/91a7b528-80eb-42ed-a74d-c6fbd5a26116/",
    user="(user)", password="(password)"
)

v21_get_objects(coll, type="foo")

Run against the Medallion server, I get as output:

GOT RESPONSE resp.content=b'{}' resp.text='{}' resp.status_code=200 resp.headers={'Content-Type': 'application/taxii+json;version=2.1', 'Content-Length': '2', 'Server': 'Werkzeug/2.0.2 Python/3.9.13', 'Date': 'Thu, 08 Dec 2022 01:15:49 GMT'} resp.encoding=None

Again, you can see there is no Chinese.

The (Chinese) text you see comes from the resp.text code fragment. That is invoking the requests library's decoding logic, including figuring out encodings. As documented, it makes "educated guesses" at the encoding. Maybe in your case, it guessed wrong? The linked docs say you can show the encoding it is using via resp.encoding, and I added that to the code to see what it would show me. It just shows None for me, so maybe not very informative. I wonder if it would show you something else?

The TAXII 2.1 spec looks to require implementers to use UTF-8.

BEAdi commented 1 year ago

We added the encoding, and it also shows None.

2022-12-15T07:39:23.616361458Z info: (DHS Feed v2_instance_1_DHS Feed v2_dhs-get-indicators) python logging: DEBUG [urllib3.connectionpool] - https://ais2.cisa.dhs.gov:443 "GET /public/collections/a6313101-fa6c-4276-bb96-7e826f0b248a/objects/?limit=10&added_after=2022-12-14T07%3A39%3A23.038316Z HTTP/1.1" 200 2
2022-12-15T07:39:23.618157698Z debug: (DHS Feed v2_instance_1_DHS Feed v2_dhs-get-indicators) resp.content=b'{}' resp.text='筽' resp.status_code=200 resp.headers={'x-transaction-id': '05608c51-ddf4-4f9c-851f-38f5d3c9b546', 'Strict-Transport-Security': 'max-age=31536000 ; includeSubDomains', 'Date': 'Thu, 15 Dec 2022 07:39:23 GMT', 'Keep-Alive': 'timeout=60', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Cache-Control': 'no-cache, no-store, max-age=0, must-revalidate', 'Pragma': 'no-cache', 'Expires': '0', 'X-Frame-Options': 'DENY', 'Content-Type': 'application/taxii+json;version=2.1', 'Content-Length': '2', 'Connection': 'keep-alive'} resp.encoding=None

Weird that in your case it guesses right, and in ours it guesses Chinese.

chisholm commented 1 year ago

Checking the requests implementation, looks like if resp.encoding is None, it falls back to resp.apparent_encoding. I think the latter is what triggers the encoding "guess". If I add a print out of that, I get:

GOT RESPONSE resp.content=b'{}' resp.text='{}' resp.status_code=200 resp.headers={'Content-Type': 'application/taxii+json;version=2.1', 'Content-Length': '2', 'Server': 'Werkzeug/2.0.2 Python/3.9.13', 'Date': 'Sat, 17 Dec 2022 02:57:08 GMT'} resp.encoding=None resp.apparent_encoding='ascii'

And that shows "ascii" for me. Maybe that will show a Chinese encoding for you.

BEAdi commented 1 year ago

When we add printing of resp.apparent_encoding, we get utf_16_be. Is there something else you can think of? We encountered the Chinese character returning in another case when using the library.

JasonKeirstead commented 1 year ago

The bug is in the TAXII server, if it is not setting the response encoding to UTF-8.

chisholm commented 1 year ago

Well, utf_16_be might be incorrect. This has gone beyond being a cti-taxii-client issue. This library relies on the requests library as mentioned above, to handle the lower-level HTTP request/response details. If the server does not tell the client what encoding it uses (JasonKeirstead's point above), the client must guess, and it is possible to guess wrong. If you don't have control over the server, I guess there's not much you can do about that.

Looks like by default, requests uses charset_normalizer to detect encodings. It calls a detect() method, but that is a legacy wrapper around from_bytes(). The latter has an interesting explain argument, which may or may not be useful. It is easy to run a test just from the python REPL:

>>> import charset_normalizer
>>> charset_normalizer.from_bytes(b'{}', explain=True)
2023-01-18 03:55:22,198 | WARNING | override steps (5) and chunk_size (512) as content does not fit (2 byte(s) given) parameters.
2023-01-18 03:55:22,202 | WARNING | Trying to detect encoding from a tiny portion of (2) byte(s).
2023-01-18 03:55:22,204 | INFO | ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
2023-01-18 03:55:22,205 | INFO | ascii should target any language(s) of ['Latin Based']
2023-01-18 03:55:22,205 | INFO | ascii is most likely the one. Stopping the process.
<charset_normalizer.models.CharsetMatches object at 0x000001ECF224B4F0>
Ni-Knight commented 1 year ago

What an odd bug :) I think we can try and ask them which server did they spin up. However you are right this is definitely not an issue with the client itself, It also seems like chardet does guess the encoding correctly as @chisholm stated (and I've also tested it).